All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/14] TCP transport binding for NVMe over Fabrics
@ 2018-11-20  3:00 ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

Changes from v1:
- unified skb_copy_datagram_iter and skb_copy_and_csum_datagram (and the
  new skb_hash_and_copy_datagram_iter) to a single code path
- removed nvmet modparam budgets (made them a define set to their default
  values)
- fixed nvme-tcp host chained r2t transfers reported off-list
- made .install_queue callout return nvme status code
- Added some review tags
- rebased on top of nvme-4.21 branch (nvme tree) + sqflow disable patches

This patch set implements the NVMe over Fabrics TCP host and the target
drivers. Now NVMe over Fabrics can run on every Ethernet port in the world.
The implementation conforms to NVMe over Fabrics 1.1 specification (which
will include already publicly available NVMe/TCP transport binding, TP 8000).

The host driver hooks into the NVMe host stack and implements the TCP
transport binding for NVMe over Fabrics. The NVMe over Fabrics TCP host
driver is responsible for establishing a NVMe/TCP connection, TCP event
and error handling and data-plane messaging and stream processing.

The target driver hooks into the NVMe target core stack and implements
the TCP transport binding. The NVMe over Fabrics target driver is
responsible for accepting and establishing NVMe/TCP connections, TCP
event and error handling, and data-plane messaging and stream processing.

The implementation of both the host and target are fairly simple and
straight-forward. Every NVMe queue is backed by a TCP socket that provides
us reliable, in-order delivery of fabrics capsules and/or data.

All NVMe queues are sharded over a private bound workqueue such that we
always have a single context handling the byte stream and we don't need
to worry about any locking/serialization. In addition, close attention
was paid to a completely non-blocking data plane to minimize context
switching and/or unforced scheduling.

I piggybacked nvme-cli patches to the set for completeness.

Also, @netdev mailing list is cc'd as this patch set contains generic
helpers for online digest calculation (patches 1-3).

The patchset structure:
- patches 1-6 are prep to add a helper for digest calculation online with data placement
- patches 7-11 are preparatory patches for NVMe/TCP
- patches 12-14 implements NVMe/TCP
- patches 15-17 are nvme-cli additions for NVMe/TCP

Thanks to the members of the Fabrics Linux Driver team that helped development,
testing and benchmarking this work.

Gitweb code is available at:

	git://git.infradead.org/nvme.git nvme-tcp

Sagi Grimberg (14):
  ath6kl: add ath6kl_ prefix to crypto_type
  datagram: open-code copy_page_to_iter
  iov_iter: pass void csum pointer to csum_and_copy_to_iter
  net/datagram: consolidate datagram copy to iter helpers
  iov_iter: introduce hash_and_copy_to_iter helper
  datagram: introduce skb_copy_and_hash_datagram_iter helper
  nvme-core: add work elements to struct nvme_ctrl
  nvmet: Add install_queue callout
  nvmet: allow configfs tcp trtype configuration
  nvme-fabrics: allow user passing header digest
  nvme-fabrics: allow user passing data digest
  nvme-tcp: Add protocol header
  nvmet-tcp: add NVMe over TCP target driver
  nvme-tcp: add NVMe over TCP host driver

 drivers/net/wireless/ath/ath6kl/cfg80211.c |    2 +-
 drivers/net/wireless/ath/ath6kl/common.h   |    2 +-
 drivers/net/wireless/ath/ath6kl/wmi.c      |    6 +-
 drivers/net/wireless/ath/ath6kl/wmi.h      |    6 +-
 drivers/nvme/host/Kconfig                  |   15 +
 drivers/nvme/host/Makefile                 |    3 +
 drivers/nvme/host/fabrics.c                |   10 +
 drivers/nvme/host/fabrics.h                |    4 +
 drivers/nvme/host/fc.c                     |   18 +-
 drivers/nvme/host/nvme.h                   |    2 +
 drivers/nvme/host/rdma.c                   |   19 +-
 drivers/nvme/host/tcp.c                    | 2306 ++++++++++++++++++++
 drivers/nvme/target/Kconfig                |   10 +
 drivers/nvme/target/Makefile               |    2 +
 drivers/nvme/target/configfs.c             |    1 +
 drivers/nvme/target/fabrics-cmd.c          |    9 +
 drivers/nvme/target/nvmet.h                |    1 +
 drivers/nvme/target/tcp.c                  | 1741 +++++++++++++++
 include/linux/nvme-tcp.h                   |  189 ++
 include/linux/nvme.h                       |    1 +
 include/linux/skbuff.h                     |    3 +
 include/linux/uio.h                        |    5 +-
 lib/iov_iter.c                             |   19 +-
 net/core/datagram.c                        |  158 +-
 24 files changed, 4406 insertions(+), 126 deletions(-)
 create mode 100644 drivers/nvme/host/tcp.c
 create mode 100644 drivers/nvme/target/tcp.c
 create mode 100644 include/linux/nvme-tcp.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 00/14] TCP transport binding for NVMe over Fabrics
@ 2018-11-20  3:00 ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


Changes from v1:
- unified skb_copy_datagram_iter and skb_copy_and_csum_datagram (and the
  new skb_hash_and_copy_datagram_iter) to a single code path
- removed nvmet modparam budgets (made them a define set to their default
  values)
- fixed nvme-tcp host chained r2t transfers reported off-list
- made .install_queue callout return nvme status code
- Added some review tags
- rebased on top of nvme-4.21 branch (nvme tree) + sqflow disable patches

This patch set implements the NVMe over Fabrics TCP host and the target
drivers. Now NVMe over Fabrics can run on every Ethernet port in the world.
The implementation conforms to NVMe over Fabrics 1.1 specification (which
will include already publicly available NVMe/TCP transport binding, TP 8000).

The host driver hooks into the NVMe host stack and implements the TCP
transport binding for NVMe over Fabrics. The NVMe over Fabrics TCP host
driver is responsible for establishing a NVMe/TCP connection, TCP event
and error handling and data-plane messaging and stream processing.

The target driver hooks into the NVMe target core stack and implements
the TCP transport binding. The NVMe over Fabrics target driver is
responsible for accepting and establishing NVMe/TCP connections, TCP
event and error handling, and data-plane messaging and stream processing.

The implementation of both the host and target are fairly simple and
straight-forward. Every NVMe queue is backed by a TCP socket that provides
us reliable, in-order delivery of fabrics capsules and/or data.

All NVMe queues are sharded over a private bound workqueue such that we
always have a single context handling the byte stream and we don't need
to worry about any locking/serialization. In addition, close attention
was paid to a completely non-blocking data plane to minimize context
switching and/or unforced scheduling.

I piggybacked nvme-cli patches to the set for completeness.

Also, @netdev mailing list is cc'd as this patch set contains generic
helpers for online digest calculation (patches 1-3).

The patchset structure:
- patches 1-6 are prep to add a helper for digest calculation online with data placement
- patches 7-11 are preparatory patches for NVMe/TCP
- patches 12-14 implements NVMe/TCP
- patches 15-17 are nvme-cli additions for NVMe/TCP

Thanks to the members of the Fabrics Linux Driver team that helped development,
testing and benchmarking this work.

Gitweb code is available at:

	git://git.infradead.org/nvme.git nvme-tcp

Sagi Grimberg (14):
  ath6kl: add ath6kl_ prefix to crypto_type
  datagram: open-code copy_page_to_iter
  iov_iter: pass void csum pointer to csum_and_copy_to_iter
  net/datagram: consolidate datagram copy to iter helpers
  iov_iter: introduce hash_and_copy_to_iter helper
  datagram: introduce skb_copy_and_hash_datagram_iter helper
  nvme-core: add work elements to struct nvme_ctrl
  nvmet: Add install_queue callout
  nvmet: allow configfs tcp trtype configuration
  nvme-fabrics: allow user passing header digest
  nvme-fabrics: allow user passing data digest
  nvme-tcp: Add protocol header
  nvmet-tcp: add NVMe over TCP target driver
  nvme-tcp: add NVMe over TCP host driver

 drivers/net/wireless/ath/ath6kl/cfg80211.c |    2 +-
 drivers/net/wireless/ath/ath6kl/common.h   |    2 +-
 drivers/net/wireless/ath/ath6kl/wmi.c      |    6 +-
 drivers/net/wireless/ath/ath6kl/wmi.h      |    6 +-
 drivers/nvme/host/Kconfig                  |   15 +
 drivers/nvme/host/Makefile                 |    3 +
 drivers/nvme/host/fabrics.c                |   10 +
 drivers/nvme/host/fabrics.h                |    4 +
 drivers/nvme/host/fc.c                     |   18 +-
 drivers/nvme/host/nvme.h                   |    2 +
 drivers/nvme/host/rdma.c                   |   19 +-
 drivers/nvme/host/tcp.c                    | 2306 ++++++++++++++++++++
 drivers/nvme/target/Kconfig                |   10 +
 drivers/nvme/target/Makefile               |    2 +
 drivers/nvme/target/configfs.c             |    1 +
 drivers/nvme/target/fabrics-cmd.c          |    9 +
 drivers/nvme/target/nvmet.h                |    1 +
 drivers/nvme/target/tcp.c                  | 1741 +++++++++++++++
 include/linux/nvme-tcp.h                   |  189 ++
 include/linux/nvme.h                       |    1 +
 include/linux/skbuff.h                     |    3 +
 include/linux/uio.h                        |    5 +-
 lib/iov_iter.c                             |   19 +-
 net/core/datagram.c                        |  158 +-
 24 files changed, 4406 insertions(+), 126 deletions(-)
 create mode 100644 drivers/nvme/host/tcp.c
 create mode 100644 drivers/nvme/target/tcp.c
 create mode 100644 include/linux/nvme-tcp.h

-- 
2.17.1

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 01/14] ath6kl: add ath6kl_ prefix to crypto_type
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

Prevent a namespace conflict as in following patches as skbuff.h will
include the crypto API.

Cc: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 drivers/net/wireless/ath/ath6kl/cfg80211.c | 2 +-
 drivers/net/wireless/ath/ath6kl/common.h   | 2 +-
 drivers/net/wireless/ath/ath6kl/wmi.c      | 6 +++---
 drivers/net/wireless/ath/ath6kl/wmi.h      | 6 +++---
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/ath/ath6kl/cfg80211.c b/drivers/net/wireless/ath/ath6kl/cfg80211.c
index e121187f371f..fa049c4ae315 100644
--- a/drivers/net/wireless/ath/ath6kl/cfg80211.c
+++ b/drivers/net/wireless/ath/ath6kl/cfg80211.c
@@ -1322,7 +1322,7 @@ static int ath6kl_cfg80211_set_default_key(struct wiphy *wiphy,
 	struct ath6kl_vif *vif = netdev_priv(ndev);
 	struct ath6kl_key *key = NULL;
 	u8 key_usage;
-	enum crypto_type key_type = NONE_CRYPT;
+	enum ath6kl_crypto_type key_type = NONE_CRYPT;
 
 	ath6kl_dbg(ATH6KL_DBG_WLAN_CFG, "%s: index %d\n", __func__, key_index);
 
diff --git a/drivers/net/wireless/ath/ath6kl/common.h b/drivers/net/wireless/ath/ath6kl/common.h
index 4f82e8632d37..d6e5234f67a1 100644
--- a/drivers/net/wireless/ath/ath6kl/common.h
+++ b/drivers/net/wireless/ath/ath6kl/common.h
@@ -67,7 +67,7 @@ struct ath6kl_llc_snap_hdr {
 	__be16 eth_type;
 } __packed;
 
-enum crypto_type {
+enum ath6kl_crypto_type {
 	NONE_CRYPT          = 0x01,
 	WEP_CRYPT           = 0x02,
 	TKIP_CRYPT          = 0x04,
diff --git a/drivers/net/wireless/ath/ath6kl/wmi.c b/drivers/net/wireless/ath/ath6kl/wmi.c
index 777acc564ac9..9d7ac1ab2d02 100644
--- a/drivers/net/wireless/ath/ath6kl/wmi.c
+++ b/drivers/net/wireless/ath/ath6kl/wmi.c
@@ -1849,9 +1849,9 @@ int ath6kl_wmi_connect_cmd(struct wmi *wmi, u8 if_idx,
 			   enum network_type nw_type,
 			   enum dot11_auth_mode dot11_auth_mode,
 			   enum auth_mode auth_mode,
-			   enum crypto_type pairwise_crypto,
+			   enum ath6kl_crypto_type pairwise_crypto,
 			   u8 pairwise_crypto_len,
-			   enum crypto_type group_crypto,
+			   enum ath6kl_crypto_type group_crypto,
 			   u8 group_crypto_len, int ssid_len, u8 *ssid,
 			   u8 *bssid, u16 channel, u32 ctrl_flags,
 			   u8 nw_subtype)
@@ -2301,7 +2301,7 @@ int ath6kl_wmi_disctimeout_cmd(struct wmi *wmi, u8 if_idx, u8 timeout)
 }
 
 int ath6kl_wmi_addkey_cmd(struct wmi *wmi, u8 if_idx, u8 key_index,
-			  enum crypto_type key_type,
+			  enum ath6kl_crypto_type key_type,
 			  u8 key_usage, u8 key_len,
 			  u8 *key_rsc, unsigned int key_rsc_len,
 			  u8 *key_material,
diff --git a/drivers/net/wireless/ath/ath6kl/wmi.h b/drivers/net/wireless/ath/ath6kl/wmi.h
index a60bb49fe920..784940ba4c90 100644
--- a/drivers/net/wireless/ath/ath6kl/wmi.h
+++ b/drivers/net/wireless/ath/ath6kl/wmi.h
@@ -2556,9 +2556,9 @@ int ath6kl_wmi_connect_cmd(struct wmi *wmi, u8 if_idx,
 			   enum network_type nw_type,
 			   enum dot11_auth_mode dot11_auth_mode,
 			   enum auth_mode auth_mode,
-			   enum crypto_type pairwise_crypto,
+			   enum ath6kl_crypto_type pairwise_crypto,
 			   u8 pairwise_crypto_len,
-			   enum crypto_type group_crypto,
+			   enum ath6kl_crypto_type group_crypto,
 			   u8 group_crypto_len, int ssid_len, u8 *ssid,
 			   u8 *bssid, u16 channel, u32 ctrl_flags,
 			   u8 nw_subtype);
@@ -2610,7 +2610,7 @@ int ath6kl_wmi_config_debug_module_cmd(struct wmi *wmi, u32 valid, u32 config);
 
 int ath6kl_wmi_get_stats_cmd(struct wmi *wmi, u8 if_idx);
 int ath6kl_wmi_addkey_cmd(struct wmi *wmi, u8 if_idx, u8 key_index,
-			  enum crypto_type key_type,
+			  enum ath6kl_crypto_type key_type,
 			  u8 key_usage, u8 key_len,
 			  u8 *key_rsc, unsigned int key_rsc_len,
 			  u8 *key_material,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 01/14] ath6kl: add ath6kl_ prefix to crypto_type
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

Prevent a namespace conflict as in following patches as skbuff.h will
include the crypto API.

Cc: Kalle Valo <kvalo at codeaurora.org>
Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 drivers/net/wireless/ath/ath6kl/cfg80211.c | 2 +-
 drivers/net/wireless/ath/ath6kl/common.h   | 2 +-
 drivers/net/wireless/ath/ath6kl/wmi.c      | 6 +++---
 drivers/net/wireless/ath/ath6kl/wmi.h      | 6 +++---
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/ath/ath6kl/cfg80211.c b/drivers/net/wireless/ath/ath6kl/cfg80211.c
index e121187f371f..fa049c4ae315 100644
--- a/drivers/net/wireless/ath/ath6kl/cfg80211.c
+++ b/drivers/net/wireless/ath/ath6kl/cfg80211.c
@@ -1322,7 +1322,7 @@ static int ath6kl_cfg80211_set_default_key(struct wiphy *wiphy,
 	struct ath6kl_vif *vif = netdev_priv(ndev);
 	struct ath6kl_key *key = NULL;
 	u8 key_usage;
-	enum crypto_type key_type = NONE_CRYPT;
+	enum ath6kl_crypto_type key_type = NONE_CRYPT;
 
 	ath6kl_dbg(ATH6KL_DBG_WLAN_CFG, "%s: index %d\n", __func__, key_index);
 
diff --git a/drivers/net/wireless/ath/ath6kl/common.h b/drivers/net/wireless/ath/ath6kl/common.h
index 4f82e8632d37..d6e5234f67a1 100644
--- a/drivers/net/wireless/ath/ath6kl/common.h
+++ b/drivers/net/wireless/ath/ath6kl/common.h
@@ -67,7 +67,7 @@ struct ath6kl_llc_snap_hdr {
 	__be16 eth_type;
 } __packed;
 
-enum crypto_type {
+enum ath6kl_crypto_type {
 	NONE_CRYPT          = 0x01,
 	WEP_CRYPT           = 0x02,
 	TKIP_CRYPT          = 0x04,
diff --git a/drivers/net/wireless/ath/ath6kl/wmi.c b/drivers/net/wireless/ath/ath6kl/wmi.c
index 777acc564ac9..9d7ac1ab2d02 100644
--- a/drivers/net/wireless/ath/ath6kl/wmi.c
+++ b/drivers/net/wireless/ath/ath6kl/wmi.c
@@ -1849,9 +1849,9 @@ int ath6kl_wmi_connect_cmd(struct wmi *wmi, u8 if_idx,
 			   enum network_type nw_type,
 			   enum dot11_auth_mode dot11_auth_mode,
 			   enum auth_mode auth_mode,
-			   enum crypto_type pairwise_crypto,
+			   enum ath6kl_crypto_type pairwise_crypto,
 			   u8 pairwise_crypto_len,
-			   enum crypto_type group_crypto,
+			   enum ath6kl_crypto_type group_crypto,
 			   u8 group_crypto_len, int ssid_len, u8 *ssid,
 			   u8 *bssid, u16 channel, u32 ctrl_flags,
 			   u8 nw_subtype)
@@ -2301,7 +2301,7 @@ int ath6kl_wmi_disctimeout_cmd(struct wmi *wmi, u8 if_idx, u8 timeout)
 }
 
 int ath6kl_wmi_addkey_cmd(struct wmi *wmi, u8 if_idx, u8 key_index,
-			  enum crypto_type key_type,
+			  enum ath6kl_crypto_type key_type,
 			  u8 key_usage, u8 key_len,
 			  u8 *key_rsc, unsigned int key_rsc_len,
 			  u8 *key_material,
diff --git a/drivers/net/wireless/ath/ath6kl/wmi.h b/drivers/net/wireless/ath/ath6kl/wmi.h
index a60bb49fe920..784940ba4c90 100644
--- a/drivers/net/wireless/ath/ath6kl/wmi.h
+++ b/drivers/net/wireless/ath/ath6kl/wmi.h
@@ -2556,9 +2556,9 @@ int ath6kl_wmi_connect_cmd(struct wmi *wmi, u8 if_idx,
 			   enum network_type nw_type,
 			   enum dot11_auth_mode dot11_auth_mode,
 			   enum auth_mode auth_mode,
-			   enum crypto_type pairwise_crypto,
+			   enum ath6kl_crypto_type pairwise_crypto,
 			   u8 pairwise_crypto_len,
-			   enum crypto_type group_crypto,
+			   enum ath6kl_crypto_type group_crypto,
 			   u8 group_crypto_len, int ssid_len, u8 *ssid,
 			   u8 *bssid, u16 channel, u32 ctrl_flags,
 			   u8 nw_subtype);
@@ -2610,7 +2610,7 @@ int ath6kl_wmi_config_debug_module_cmd(struct wmi *wmi, u32 valid, u32 config);
 
 int ath6kl_wmi_get_stats_cmd(struct wmi *wmi, u8 if_idx);
 int ath6kl_wmi_addkey_cmd(struct wmi *wmi, u8 if_idx, u8 key_index,
-			  enum crypto_type key_type,
+			  enum ath6kl_crypto_type key_type,
 			  u8 key_usage, u8 key_len,
 			  u8 *key_rsc, unsigned int key_rsc_len,
 			  u8 *key_material,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 02/14] datagram: open-code copy_page_to_iter
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

This will be useful to consolidate skb_copy_and_hash_datagram_iter and
skb_copy_and_csum_datagram to a single code path.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 net/core/datagram.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..abe642181b64 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -445,11 +445,14 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 
 		end = start + skb_frag_size(frag);
 		if ((copy = end - offset) > 0) {
+			struct page *page = skb_frag_page(frag);
+			u8 *vaddr = kmap(page);
+
 			if (copy > len)
 				copy = len;
-			n = copy_page_to_iter(skb_frag_page(frag),
-					      frag->page_offset + offset -
-					      start, copy, to);
+			n = copy_to_iter(vaddr + frag->page_offset +
+					 offset - start, copy, to);
+			kunmap(page);
 			offset += n;
 			if (n != copy)
 				goto short_copy;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 02/14] datagram: open-code copy_page_to_iter
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


This will be useful to consolidate skb_copy_and_hash_datagram_iter and
skb_copy_and_csum_datagram to a single code path.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 net/core/datagram.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..abe642181b64 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -445,11 +445,14 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 
 		end = start + skb_frag_size(frag);
 		if ((copy = end - offset) > 0) {
+			struct page *page = skb_frag_page(frag);
+			u8 *vaddr = kmap(page);
+
 			if (copy > len)
 				copy = len;
-			n = copy_page_to_iter(skb_frag_page(frag),
-					      frag->page_offset + offset -
-					      start, copy, to);
+			n = copy_to_iter(vaddr + frag->page_offset +
+					 offset - start, copy, to);
+			kunmap(page);
 			offset += n;
 			if (n != copy)
 				goto short_copy;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 03/14] iov_iter: pass void csum pointer to csum_and_copy_to_iter
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

The single caller to csum_and_copy_to_iter is skb_copy_and_csum_datagram
and we are trying to unite its logic with skb_copy_datagram_iter by passing
a callback to the copy function that we want to apply. Thus, we need
to make the checksum pointer private to the function.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 include/linux/uio.h | 2 +-
 lib/iov_iter.c      | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 55ce99ddb912..41d1f8d3313d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -266,7 +266,7 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
 {
 	i->count = count;
 }
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
+size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i);
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
 bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7ebccb5c1637..db93531ca3e3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1432,10 +1432,11 @@ bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum,
 }
 EXPORT_SYMBOL(csum_and_copy_from_iter_full);
 
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
+size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump,
 			     struct iov_iter *i)
 {
 	const char *from = addr;
+	__wsum *csum = csump;
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 03/14] iov_iter: pass void csum pointer to csum_and_copy_to_iter
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


The single caller to csum_and_copy_to_iter is skb_copy_and_csum_datagram
and we are trying to unite its logic with skb_copy_datagram_iter by passing
a callback to the copy function that we want to apply. Thus, we need
to make the checksum pointer private to the function.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 include/linux/uio.h | 2 +-
 lib/iov_iter.c      | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 55ce99ddb912..41d1f8d3313d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -266,7 +266,7 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
 {
 	i->count = count;
 }
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
+size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i);
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
 bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7ebccb5c1637..db93531ca3e3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1432,10 +1432,11 @@ bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum,
 }
 EXPORT_SYMBOL(csum_and_copy_from_iter_full);
 
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
+size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump,
 			     struct iov_iter *i)
 {
 	const char *from = addr;
+	__wsum *csum = csump;
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 04/14] datagram: consolidate datagram copy to iter helpers
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

skb_copy_datagram_iter and skb_copy_and_csum_datagram are essentialy
the same but with a couple of differences: The first is the copy
operation used which either a simple copy or a csum_and_copy, and the
second are the behavior on the "short copy" path where simply copy
needs to return the number of bytes successfully copied while csum_and_copy
needs to fault immediately as the checksum is partial.

Introduce __skb_datagram_iter that additionally accepts:
1. copy operation function pointer
2. private data that goes with the copy operation
3. fault_short flag to indicate the action on short copy

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 net/core/datagram.c | 136 ++++++++++++++------------------------------
 1 file changed, 42 insertions(+), 94 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index abe642181b64..382543302ae5 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -408,27 +408,20 @@ int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 }
 EXPORT_SYMBOL(skb_kill_datagram);
 
-/**
- *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
- *	@skb: buffer to copy
- *	@offset: offset in the buffer to start copying from
- *	@to: iovec iterator to copy to
- *	@len: amount of data to copy from buffer to iovec
- */
-int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
-			   struct iov_iter *to, int len)
+int __skb_datagram_iter(const struct sk_buff *skb, int offset,
+			struct iov_iter *to, int len, bool fault_short,
+			size_t (*cb)(const void *, size_t, void *, struct iov_iter *),
+			void *data)
 {
 	int start = skb_headlen(skb);
 	int i, copy = start - offset, start_off = offset, n;
 	struct sk_buff *frag_iter;
 
-	trace_skb_copy_datagram_iovec(skb, len);
-
 	/* Copy header. */
 	if (copy > 0) {
 		if (copy > len)
 			copy = len;
-		n = copy_to_iter(skb->data + offset, copy, to);
+		n = cb(skb->data + offset, copy, data, to);
 		offset += n;
 		if (n != copy)
 			goto short_copy;
@@ -450,8 +443,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 
 			if (copy > len)
 				copy = len;
-			n = copy_to_iter(vaddr + frag->page_offset +
-					 offset - start, copy, to);
+			n = cb(vaddr + frag->page_offset +
+				offset - start, copy, data, to);
 			kunmap(page);
 			offset += n;
 			if (n != copy)
@@ -471,8 +464,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 		if ((copy = end - offset) > 0) {
 			if (copy > len)
 				copy = len;
-			if (skb_copy_datagram_iter(frag_iter, offset - start,
-						   to, copy))
+			if (__skb_datagram_iter(frag_iter, offset - start,
+						to, copy, short_copy, cb, data))
 				goto fault;
 			if ((len -= copy) == 0)
 				return 0;
@@ -493,11 +486,32 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 	return -EFAULT;
 
 short_copy:
-	if (iov_iter_count(to))
+	if (fault_short || iov_iter_count(to))
 		goto fault;
 
 	return 0;
 }
+
+static size_t simple_copy_to_iter(const void *addr, size_t bytes,
+		void *data __always_unused, struct iov_iter *i)
+{
+	return copy_to_iter(addr, bytes, i);
+}
+
+/**
+ *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ */
+int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len)
+{
+	trace_skb_copy_datagram_iovec(skb, len);
+	return __skb_datagram_iter(skb, offset, to, len, false,
+			simple_copy_to_iter, NULL);
+}
 EXPORT_SYMBOL(skb_copy_datagram_iter);
 
 /**
@@ -648,87 +662,21 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
+/**
+ *	skb_copy_and_csum_datagram_iter - Copy datagram to an iovec iterator
+ *          and update a checksum.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ *      @csump: checksum pointer
+ */
 static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 				      struct iov_iter *to, int len,
 				      __wsum *csump)
 {
-	int start = skb_headlen(skb);
-	int i, copy = start - offset, start_off = offset;
-	struct sk_buff *frag_iter;
-	int pos = 0;
-	int n;
-
-	/* Copy header. */
-	if (copy > 0) {
-		if (copy > len)
-			copy = len;
-		n = csum_and_copy_to_iter(skb->data + offset, copy, csump, to);
-		offset += n;
-		if (n != copy)
-			goto fault;
-		if ((len -= copy) == 0)
-			return 0;
-		pos = copy;
-	}
-
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		int end;
-		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
-		WARN_ON(start > offset + len);
-
-		end = start + skb_frag_size(frag);
-		if ((copy = end - offset) > 0) {
-			__wsum csum2 = 0;
-			struct page *page = skb_frag_page(frag);
-			u8  *vaddr = kmap(page);
-
-			if (copy > len)
-				copy = len;
-			n = csum_and_copy_to_iter(vaddr + frag->page_offset +
-						  offset - start, copy,
-						  &csum2, to);
-			kunmap(page);
-			offset += n;
-			if (n != copy)
-				goto fault;
-			*csump = csum_block_add(*csump, csum2, pos);
-			if (!(len -= copy))
-				return 0;
-			pos += copy;
-		}
-		start = end;
-	}
-
-	skb_walk_frags(skb, frag_iter) {
-		int end;
-
-		WARN_ON(start > offset + len);
-
-		end = start + frag_iter->len;
-		if ((copy = end - offset) > 0) {
-			__wsum csum2 = 0;
-			if (copy > len)
-				copy = len;
-			if (skb_copy_and_csum_datagram(frag_iter,
-						       offset - start,
-						       to, copy,
-						       &csum2))
-				goto fault;
-			*csump = csum_block_add(*csump, csum2, pos);
-			if ((len -= copy) == 0)
-				return 0;
-			offset += copy;
-			pos += copy;
-		}
-		start = end;
-	}
-	if (!len)
-		return 0;
-
-fault:
-	iov_iter_revert(to, offset - start_off);
-	return -EFAULT;
+	return __skb_datagram_iter(skb, offset, to, len, true,
+			csum_and_copy_to_iter, csump);
 }
 
 __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 04/14] datagram: consolidate datagram copy to iter helpers
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


skb_copy_datagram_iter and skb_copy_and_csum_datagram are essentialy
the same but with a couple of differences: The first is the copy
operation used which either a simple copy or a csum_and_copy, and the
second are the behavior on the "short copy" path where simply copy
needs to return the number of bytes successfully copied while csum_and_copy
needs to fault immediately as the checksum is partial.

Introduce __skb_datagram_iter that additionally accepts:
1. copy operation function pointer
2. private data that goes with the copy operation
3. fault_short flag to indicate the action on short copy

Suggested-by: David S. Miller <davem at davemloft.net>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 net/core/datagram.c | 136 ++++++++++++++------------------------------
 1 file changed, 42 insertions(+), 94 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index abe642181b64..382543302ae5 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -408,27 +408,20 @@ int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 }
 EXPORT_SYMBOL(skb_kill_datagram);
 
-/**
- *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
- *	@skb: buffer to copy
- *	@offset: offset in the buffer to start copying from
- *	@to: iovec iterator to copy to
- *	@len: amount of data to copy from buffer to iovec
- */
-int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
-			   struct iov_iter *to, int len)
+int __skb_datagram_iter(const struct sk_buff *skb, int offset,
+			struct iov_iter *to, int len, bool fault_short,
+			size_t (*cb)(const void *, size_t, void *, struct iov_iter *),
+			void *data)
 {
 	int start = skb_headlen(skb);
 	int i, copy = start - offset, start_off = offset, n;
 	struct sk_buff *frag_iter;
 
-	trace_skb_copy_datagram_iovec(skb, len);
-
 	/* Copy header. */
 	if (copy > 0) {
 		if (copy > len)
 			copy = len;
-		n = copy_to_iter(skb->data + offset, copy, to);
+		n = cb(skb->data + offset, copy, data, to);
 		offset += n;
 		if (n != copy)
 			goto short_copy;
@@ -450,8 +443,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 
 			if (copy > len)
 				copy = len;
-			n = copy_to_iter(vaddr + frag->page_offset +
-					 offset - start, copy, to);
+			n = cb(vaddr + frag->page_offset +
+				offset - start, copy, data, to);
 			kunmap(page);
 			offset += n;
 			if (n != copy)
@@ -471,8 +464,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 		if ((copy = end - offset) > 0) {
 			if (copy > len)
 				copy = len;
-			if (skb_copy_datagram_iter(frag_iter, offset - start,
-						   to, copy))
+			if (__skb_datagram_iter(frag_iter, offset - start,
+						to, copy, short_copy, cb, data))
 				goto fault;
 			if ((len -= copy) == 0)
 				return 0;
@@ -493,11 +486,32 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 	return -EFAULT;
 
 short_copy:
-	if (iov_iter_count(to))
+	if (fault_short || iov_iter_count(to))
 		goto fault;
 
 	return 0;
 }
+
+static size_t simple_copy_to_iter(const void *addr, size_t bytes,
+		void *data __always_unused, struct iov_iter *i)
+{
+	return copy_to_iter(addr, bytes, i);
+}
+
+/**
+ *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ */
+int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len)
+{
+	trace_skb_copy_datagram_iovec(skb, len);
+	return __skb_datagram_iter(skb, offset, to, len, false,
+			simple_copy_to_iter, NULL);
+}
 EXPORT_SYMBOL(skb_copy_datagram_iter);
 
 /**
@@ -648,87 +662,21 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
+/**
+ *	skb_copy_and_csum_datagram_iter - Copy datagram to an iovec iterator
+ *          and update a checksum.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ *      @csump: checksum pointer
+ */
 static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 				      struct iov_iter *to, int len,
 				      __wsum *csump)
 {
-	int start = skb_headlen(skb);
-	int i, copy = start - offset, start_off = offset;
-	struct sk_buff *frag_iter;
-	int pos = 0;
-	int n;
-
-	/* Copy header. */
-	if (copy > 0) {
-		if (copy > len)
-			copy = len;
-		n = csum_and_copy_to_iter(skb->data + offset, copy, csump, to);
-		offset += n;
-		if (n != copy)
-			goto fault;
-		if ((len -= copy) == 0)
-			return 0;
-		pos = copy;
-	}
-
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		int end;
-		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
-		WARN_ON(start > offset + len);
-
-		end = start + skb_frag_size(frag);
-		if ((copy = end - offset) > 0) {
-			__wsum csum2 = 0;
-			struct page *page = skb_frag_page(frag);
-			u8  *vaddr = kmap(page);
-
-			if (copy > len)
-				copy = len;
-			n = csum_and_copy_to_iter(vaddr + frag->page_offset +
-						  offset - start, copy,
-						  &csum2, to);
-			kunmap(page);
-			offset += n;
-			if (n != copy)
-				goto fault;
-			*csump = csum_block_add(*csump, csum2, pos);
-			if (!(len -= copy))
-				return 0;
-			pos += copy;
-		}
-		start = end;
-	}
-
-	skb_walk_frags(skb, frag_iter) {
-		int end;
-
-		WARN_ON(start > offset + len);
-
-		end = start + frag_iter->len;
-		if ((copy = end - offset) > 0) {
-			__wsum csum2 = 0;
-			if (copy > len)
-				copy = len;
-			if (skb_copy_and_csum_datagram(frag_iter,
-						       offset - start,
-						       to, copy,
-						       &csum2))
-				goto fault;
-			*csump = csum_block_add(*csump, csum2, pos);
-			if ((len -= copy) == 0)
-				return 0;
-			offset += copy;
-			pos += copy;
-		}
-		start = end;
-	}
-	if (!len)
-		return 0;
-
-fault:
-	iov_iter_revert(to, offset - start_off);
-	return -EFAULT;
+	return __skb_datagram_iter(skb, offset, to, len, true,
+			csum_and_copy_to_iter, csump);
 }
 
 __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 04/14] net/datagram: consolidate datagram copy to iter helpers
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

skb_copy_datagram_iter and skb_copy_and_csum_datagram are essentialy
the same but with a couple of differences: The first is the copy
operation used which either a simple copy or a csum_and_copy, and the
second are the behavior on the "short copy" path where simply copy
needs to return the number of bytes successfully copied while csum_and_copy
needs to fault immediately as the checksum is partial.

Introduce __skb_datagram_iter that additionally accepts:
1. copy operation function pointer
2. private data that goes with the copy operation
3. fault_short flag to indicate the action on short copy

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 net/core/datagram.c | 136 ++++++++++++++------------------------------
 1 file changed, 42 insertions(+), 94 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index abe642181b64..382543302ae5 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -408,27 +408,20 @@ int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 }
 EXPORT_SYMBOL(skb_kill_datagram);
 
-/**
- *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
- *	@skb: buffer to copy
- *	@offset: offset in the buffer to start copying from
- *	@to: iovec iterator to copy to
- *	@len: amount of data to copy from buffer to iovec
- */
-int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
-			   struct iov_iter *to, int len)
+int __skb_datagram_iter(const struct sk_buff *skb, int offset,
+			struct iov_iter *to, int len, bool fault_short,
+			size_t (*cb)(const void *, size_t, void *, struct iov_iter *),
+			void *data)
 {
 	int start = skb_headlen(skb);
 	int i, copy = start - offset, start_off = offset, n;
 	struct sk_buff *frag_iter;
 
-	trace_skb_copy_datagram_iovec(skb, len);
-
 	/* Copy header. */
 	if (copy > 0) {
 		if (copy > len)
 			copy = len;
-		n = copy_to_iter(skb->data + offset, copy, to);
+		n = cb(skb->data + offset, copy, data, to);
 		offset += n;
 		if (n != copy)
 			goto short_copy;
@@ -450,8 +443,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 
 			if (copy > len)
 				copy = len;
-			n = copy_to_iter(vaddr + frag->page_offset +
-					 offset - start, copy, to);
+			n = cb(vaddr + frag->page_offset +
+				offset - start, copy, data, to);
 			kunmap(page);
 			offset += n;
 			if (n != copy)
@@ -471,8 +464,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 		if ((copy = end - offset) > 0) {
 			if (copy > len)
 				copy = len;
-			if (skb_copy_datagram_iter(frag_iter, offset - start,
-						   to, copy))
+			if (__skb_datagram_iter(frag_iter, offset - start,
+						to, copy, short_copy, cb, data))
 				goto fault;
 			if ((len -= copy) == 0)
 				return 0;
@@ -493,11 +486,32 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 	return -EFAULT;
 
 short_copy:
-	if (iov_iter_count(to))
+	if (fault_short || iov_iter_count(to))
 		goto fault;
 
 	return 0;
 }
+
+static size_t simple_copy_to_iter(const void *addr, size_t bytes,
+		void *data __always_unused, struct iov_iter *i)
+{
+	return copy_to_iter(addr, bytes, i);
+}
+
+/**
+ *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ */
+int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len)
+{
+	trace_skb_copy_datagram_iovec(skb, len);
+	return __skb_datagram_iter(skb, offset, to, len, false,
+			simple_copy_to_iter, NULL);
+}
 EXPORT_SYMBOL(skb_copy_datagram_iter);
 
 /**
@@ -648,87 +662,21 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
+/**
+ *	skb_copy_and_csum_datagram_iter - Copy datagram to an iovec iterator
+ *          and update a checksum.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ *      @csump: checksum pointer
+ */
 static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 				      struct iov_iter *to, int len,
 				      __wsum *csump)
 {
-	int start = skb_headlen(skb);
-	int i, copy = start - offset, start_off = offset;
-	struct sk_buff *frag_iter;
-	int pos = 0;
-	int n;
-
-	/* Copy header. */
-	if (copy > 0) {
-		if (copy > len)
-			copy = len;
-		n = csum_and_copy_to_iter(skb->data + offset, copy, csump, to);
-		offset += n;
-		if (n != copy)
-			goto fault;
-		if ((len -= copy) == 0)
-			return 0;
-		pos = copy;
-	}
-
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		int end;
-		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
-		WARN_ON(start > offset + len);
-
-		end = start + skb_frag_size(frag);
-		if ((copy = end - offset) > 0) {
-			__wsum csum2 = 0;
-			struct page *page = skb_frag_page(frag);
-			u8  *vaddr = kmap(page);
-
-			if (copy > len)
-				copy = len;
-			n = csum_and_copy_to_iter(vaddr + frag->page_offset +
-						  offset - start, copy,
-						  &csum2, to);
-			kunmap(page);
-			offset += n;
-			if (n != copy)
-				goto fault;
-			*csump = csum_block_add(*csump, csum2, pos);
-			if (!(len -= copy))
-				return 0;
-			pos += copy;
-		}
-		start = end;
-	}
-
-	skb_walk_frags(skb, frag_iter) {
-		int end;
-
-		WARN_ON(start > offset + len);
-
-		end = start + frag_iter->len;
-		if ((copy = end - offset) > 0) {
-			__wsum csum2 = 0;
-			if (copy > len)
-				copy = len;
-			if (skb_copy_and_csum_datagram(frag_iter,
-						       offset - start,
-						       to, copy,
-						       &csum2))
-				goto fault;
-			*csump = csum_block_add(*csump, csum2, pos);
-			if ((len -= copy) == 0)
-				return 0;
-			offset += copy;
-			pos += copy;
-		}
-		start = end;
-	}
-	if (!len)
-		return 0;
-
-fault:
-	iov_iter_revert(to, offset - start_off);
-	return -EFAULT;
+	return __skb_datagram_iter(skb, offset, to, len, true,
+			csum_and_copy_to_iter, csump);
 }
 
 __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 04/14] net/datagram: consolidate datagram copy to iter helpers
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


skb_copy_datagram_iter and skb_copy_and_csum_datagram are essentialy
the same but with a couple of differences: The first is the copy
operation used which either a simple copy or a csum_and_copy, and the
second are the behavior on the "short copy" path where simply copy
needs to return the number of bytes successfully copied while csum_and_copy
needs to fault immediately as the checksum is partial.

Introduce __skb_datagram_iter that additionally accepts:
1. copy operation function pointer
2. private data that goes with the copy operation
3. fault_short flag to indicate the action on short copy

Suggested-by: David S. Miller <davem at davemloft.net>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 net/core/datagram.c | 136 ++++++++++++++------------------------------
 1 file changed, 42 insertions(+), 94 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index abe642181b64..382543302ae5 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -408,27 +408,20 @@ int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 }
 EXPORT_SYMBOL(skb_kill_datagram);
 
-/**
- *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
- *	@skb: buffer to copy
- *	@offset: offset in the buffer to start copying from
- *	@to: iovec iterator to copy to
- *	@len: amount of data to copy from buffer to iovec
- */
-int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
-			   struct iov_iter *to, int len)
+int __skb_datagram_iter(const struct sk_buff *skb, int offset,
+			struct iov_iter *to, int len, bool fault_short,
+			size_t (*cb)(const void *, size_t, void *, struct iov_iter *),
+			void *data)
 {
 	int start = skb_headlen(skb);
 	int i, copy = start - offset, start_off = offset, n;
 	struct sk_buff *frag_iter;
 
-	trace_skb_copy_datagram_iovec(skb, len);
-
 	/* Copy header. */
 	if (copy > 0) {
 		if (copy > len)
 			copy = len;
-		n = copy_to_iter(skb->data + offset, copy, to);
+		n = cb(skb->data + offset, copy, data, to);
 		offset += n;
 		if (n != copy)
 			goto short_copy;
@@ -450,8 +443,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 
 			if (copy > len)
 				copy = len;
-			n = copy_to_iter(vaddr + frag->page_offset +
-					 offset - start, copy, to);
+			n = cb(vaddr + frag->page_offset +
+				offset - start, copy, data, to);
 			kunmap(page);
 			offset += n;
 			if (n != copy)
@@ -471,8 +464,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 		if ((copy = end - offset) > 0) {
 			if (copy > len)
 				copy = len;
-			if (skb_copy_datagram_iter(frag_iter, offset - start,
-						   to, copy))
+			if (__skb_datagram_iter(frag_iter, offset - start,
+						to, copy, short_copy, cb, data))
 				goto fault;
 			if ((len -= copy) == 0)
 				return 0;
@@ -493,11 +486,32 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 	return -EFAULT;
 
 short_copy:
-	if (iov_iter_count(to))
+	if (fault_short || iov_iter_count(to))
 		goto fault;
 
 	return 0;
 }
+
+static size_t simple_copy_to_iter(const void *addr, size_t bytes,
+		void *data __always_unused, struct iov_iter *i)
+{
+	return copy_to_iter(addr, bytes, i);
+}
+
+/**
+ *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ */
+int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len)
+{
+	trace_skb_copy_datagram_iovec(skb, len);
+	return __skb_datagram_iter(skb, offset, to, len, false,
+			simple_copy_to_iter, NULL);
+}
 EXPORT_SYMBOL(skb_copy_datagram_iter);
 
 /**
@@ -648,87 +662,21 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
+/**
+ *	skb_copy_and_csum_datagram_iter - Copy datagram to an iovec iterator
+ *          and update a checksum.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ *      @csump: checksum pointer
+ */
 static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 				      struct iov_iter *to, int len,
 				      __wsum *csump)
 {
-	int start = skb_headlen(skb);
-	int i, copy = start - offset, start_off = offset;
-	struct sk_buff *frag_iter;
-	int pos = 0;
-	int n;
-
-	/* Copy header. */
-	if (copy > 0) {
-		if (copy > len)
-			copy = len;
-		n = csum_and_copy_to_iter(skb->data + offset, copy, csump, to);
-		offset += n;
-		if (n != copy)
-			goto fault;
-		if ((len -= copy) == 0)
-			return 0;
-		pos = copy;
-	}
-
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		int end;
-		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
-		WARN_ON(start > offset + len);
-
-		end = start + skb_frag_size(frag);
-		if ((copy = end - offset) > 0) {
-			__wsum csum2 = 0;
-			struct page *page = skb_frag_page(frag);
-			u8  *vaddr = kmap(page);
-
-			if (copy > len)
-				copy = len;
-			n = csum_and_copy_to_iter(vaddr + frag->page_offset +
-						  offset - start, copy,
-						  &csum2, to);
-			kunmap(page);
-			offset += n;
-			if (n != copy)
-				goto fault;
-			*csump = csum_block_add(*csump, csum2, pos);
-			if (!(len -= copy))
-				return 0;
-			pos += copy;
-		}
-		start = end;
-	}
-
-	skb_walk_frags(skb, frag_iter) {
-		int end;
-
-		WARN_ON(start > offset + len);
-
-		end = start + frag_iter->len;
-		if ((copy = end - offset) > 0) {
-			__wsum csum2 = 0;
-			if (copy > len)
-				copy = len;
-			if (skb_copy_and_csum_datagram(frag_iter,
-						       offset - start,
-						       to, copy,
-						       &csum2))
-				goto fault;
-			*csump = csum_block_add(*csump, csum2, pos);
-			if ((len -= copy) == 0)
-				return 0;
-			offset += copy;
-			pos += copy;
-		}
-		start = end;
-	}
-	if (!len)
-		return 0;
-
-fault:
-	iov_iter_revert(to, offset - start_off);
-	return -EFAULT;
+	return __skb_datagram_iter(skb, offset, to, len, true,
+			csum_and_copy_to_iter, csump);
 }
 
 __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 05/14] iov_iter: introduce hash_and_copy_to_iter helper
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

Allow consumers that want to use iov iterator helpers and also update
a predefined hash calculation online when copying data. This is useful
when copying incoming network buffers to a local iterator and calculate
a digest on the incoming stream. nvme-tcp host driver that will be
introduced in following patches is the first consumer via
skb_copy_and_hash_datagram_iter.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 include/linux/uio.h |  3 +++
 lib/iov_iter.c      | 16 ++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 41d1f8d3313d..ecf584f6b82d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -11,6 +11,7 @@
 
 #include <linux/kernel.h>
 #include <linux/thread_info.h>
+#include <crypto/hash.h>
 #include <uapi/linux/uio.h>
 
 struct page;
@@ -269,6 +270,8 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
 size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i);
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
 bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
+size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
+		struct iov_iter *i);
 
 int import_iovec(int type, const struct iovec __user * uvector,
 		 unsigned nr_segs, unsigned fast_segs,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index db93531ca3e3..8a5f7b2ae346 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -6,6 +6,7 @@
 #include <linux/vmalloc.h>
 #include <linux/splice.h>
 #include <net/checksum.h>
+#include <linux/scatterlist.h>
 
 #define PIPE_PARANOIA /* for now */
 
@@ -1475,6 +1476,21 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump,
 }
 EXPORT_SYMBOL(csum_and_copy_to_iter);
 
+size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
+		struct iov_iter *i)
+{
+	struct ahash_request *hash = hashp;
+	struct scatterlist sg;
+	size_t copied;
+
+	copied = copy_to_iter(addr, bytes, i);
+	sg_init_one(&sg, addr, copied);
+	ahash_request_set_crypt(hash, &sg, NULL, copied);
+	crypto_ahash_update(hash);
+	return copied;
+}
+EXPORT_SYMBOL(hash_and_copy_to_iter);
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	size_t size = i->count;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 05/14] iov_iter: introduce hash_and_copy_to_iter helper
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

Allow consumers that want to use iov iterator helpers and also update
a predefined hash calculation online when copying data. This is useful
when copying incoming network buffers to a local iterator and calculate
a digest on the incoming stream. nvme-tcp host driver that will be
introduced in following patches is the first consumer via
skb_copy_and_hash_datagram_iter.

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 include/linux/uio.h |  3 +++
 lib/iov_iter.c      | 16 ++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 41d1f8d3313d..ecf584f6b82d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -11,6 +11,7 @@
 
 #include <linux/kernel.h>
 #include <linux/thread_info.h>
+#include <crypto/hash.h>
 #include <uapi/linux/uio.h>
 
 struct page;
@@ -269,6 +270,8 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
 size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i);
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
 bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
+size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
+		struct iov_iter *i);
 
 int import_iovec(int type, const struct iovec __user * uvector,
 		 unsigned nr_segs, unsigned fast_segs,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index db93531ca3e3..8a5f7b2ae346 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -6,6 +6,7 @@
 #include <linux/vmalloc.h>
 #include <linux/splice.h>
 #include <net/checksum.h>
+#include <linux/scatterlist.h>
 
 #define PIPE_PARANOIA /* for now */
 
@@ -1475,6 +1476,21 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump,
 }
 EXPORT_SYMBOL(csum_and_copy_to_iter);
 
+size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
+		struct iov_iter *i)
+{
+	struct ahash_request *hash = hashp;
+	struct scatterlist sg;
+	size_t copied;
+
+	copied = copy_to_iter(addr, bytes, i);
+	sg_init_one(&sg, addr, copied);
+	ahash_request_set_crypt(hash, &sg, NULL, copied);
+	crypto_ahash_update(hash);
+	return copied;
+}
+EXPORT_SYMBOL(hash_and_copy_to_iter);
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	size_t size = i->count;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 06/14] datagram: introduce skb_copy_and_hash_datagram_iter helper
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

Introduce a helper to copy datagram into an iovec iterator
but also update a predefined hash. This is useful for
consumers of skb_copy_datagram_iter to also support inflight
data digest without having to finish to copy and only then
traverse the iovec and calculate the digest hash.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 include/linux/skbuff.h |  3 +++
 net/core/datagram.c    | 17 +++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0ba687454267..b0b8d5653f0d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3309,6 +3309,9 @@ static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset,
 }
 int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, int hlen,
 				   struct msghdr *msg);
+int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len,
+			   struct ahash_request *hash);
 int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 				 struct iov_iter *from, int len);
 int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 382543302ae5..e6a4fc845f72 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -492,6 +492,23 @@ int __skb_datagram_iter(const struct sk_buff *skb, int offset,
 	return 0;
 }
 
+/**
+ *	skb_copy_and_hash_datagram_iter - Copy datagram to an iovec iterator
+ *          and update a hash.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ *      @hash: hash request to update
+ */
+int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len,
+			   struct ahash_request *hash)
+{
+	return __skb_datagram_iter(skb, offset, to, len, true,
+			hash_and_copy_to_iter, hash);
+}
+
 static size_t simple_copy_to_iter(const void *addr, size_t bytes,
 		void *data __always_unused, struct iov_iter *i)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 06/14] datagram: introduce skb_copy_and_hash_datagram_iter helper
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


Introduce a helper to copy datagram into an iovec iterator
but also update a predefined hash. This is useful for
consumers of skb_copy_datagram_iter to also support inflight
data digest without having to finish to copy and only then
traverse the iovec and calculate the digest hash.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 include/linux/skbuff.h |  3 +++
 net/core/datagram.c    | 17 +++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0ba687454267..b0b8d5653f0d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3309,6 +3309,9 @@ static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset,
 }
 int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, int hlen,
 				   struct msghdr *msg);
+int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len,
+			   struct ahash_request *hash);
 int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 				 struct iov_iter *from, int len);
 int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 382543302ae5..e6a4fc845f72 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -492,6 +492,23 @@ int __skb_datagram_iter(const struct sk_buff *skb, int offset,
 	return 0;
 }
 
+/**
+ *	skb_copy_and_hash_datagram_iter - Copy datagram to an iovec iterator
+ *          and update a hash.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ *      @hash: hash request to update
+ */
+int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len,
+			   struct ahash_request *hash)
+{
+	return __skb_datagram_iter(skb, offset, to, len, true,
+			hash_and_copy_to_iter, hash);
+}
+
 static size_t simple_copy_to_iter(const void *addr, size_t bytes,
 		void *data __always_unused, struct iov_iter *i)
 {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 07/14] nvme-core: add work elements to struct nvme_ctrl
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

connect_work and err_work will be reused by nvme-tcp so
share those in nvme_ctrl for rdma and fc to share.

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 drivers/nvme/host/fc.c   | 18 ++++++++----------
 drivers/nvme/host/nvme.h |  2 ++
 drivers/nvme/host/rdma.c | 19 ++++++++-----------
 3 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 83131e42b336..16812e427e17 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -159,8 +159,6 @@ struct nvme_fc_ctrl {
 	struct blk_mq_tag_set	admin_tag_set;
 	struct blk_mq_tag_set	tag_set;
 
-	struct delayed_work	connect_work;
-
 	struct kref		ref;
 	u32			flags;
 	u32			iocnt;
@@ -547,7 +545,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
 			"NVME-FC{%d}: connectivity re-established. "
 			"Attempting reconnect\n", ctrl->cnum);
 
-		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
+		queue_delayed_work(nvme_wq, &ctrl->ctrl.connect_work, 0);
 		break;
 
 	case NVME_CTRL_RESETTING:
@@ -2815,7 +2813,7 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
 {
 	struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);
 
-	cancel_delayed_work_sync(&ctrl->connect_work);
+	cancel_delayed_work_sync(&ctrl->ctrl.connect_work);
 	/*
 	 * kill the association on the link side.  this will block
 	 * waiting for io to terminate
@@ -2850,7 +2848,7 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
 		else if (time_after(jiffies + recon_delay, rport->dev_loss_end))
 			recon_delay = rport->dev_loss_end - jiffies;
 
-		queue_delayed_work(nvme_wq, &ctrl->connect_work, recon_delay);
+		queue_delayed_work(nvme_wq, &ctrl->ctrl.connect_work, recon_delay);
 	} else {
 		if (portptr->port_state == FC_OBJSTATE_ONLINE)
 			dev_warn(ctrl->ctrl.device,
@@ -2918,7 +2916,7 @@ nvme_fc_connect_ctrl_work(struct work_struct *work)
 
 	struct nvme_fc_ctrl *ctrl =
 			container_of(to_delayed_work(work),
-				struct nvme_fc_ctrl, connect_work);
+				struct nvme_fc_ctrl, ctrl.connect_work);
 
 	ret = nvme_fc_create_association(ctrl);
 	if (ret)
@@ -3015,7 +3013,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 	kref_init(&ctrl->ref);
 
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
-	INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
+	INIT_DELAYED_WORK(&ctrl->ctrl.connect_work, nvme_fc_connect_ctrl_work);
 	spin_lock_init(&ctrl->lock);
 
 	/* io queue count */
@@ -3086,7 +3084,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 
 	nvme_get_ctrl(&ctrl->ctrl);
 
-	if (!queue_delayed_work(nvme_wq, &ctrl->connect_work, 0)) {
+	if (!queue_delayed_work(nvme_wq, &ctrl->ctrl.connect_work, 0)) {
 		nvme_put_ctrl(&ctrl->ctrl);
 		dev_err(ctrl->ctrl.device,
 			"NVME-FC{%d}: failed to schedule initial connect\n",
@@ -3094,7 +3092,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 		goto fail_ctrl;
 	}
 
-	flush_delayed_work(&ctrl->connect_work);
+	flush_delayed_work(&ctrl->ctrl.connect_work);
 
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d}: new ctrl: NQN \"%s\"\n",
@@ -3105,7 +3103,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 fail_ctrl:
 	nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING);
 	cancel_work_sync(&ctrl->ctrl.reset_work);
-	cancel_delayed_work_sync(&ctrl->connect_work);
+	cancel_delayed_work_sync(&ctrl->ctrl.connect_work);
 
 	ctrl->ctrl.opts = NULL;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 27663ce3044e..031195e5d7d3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -240,6 +240,8 @@ struct nvme_ctrl {
 	u16 maxcmd;
 	int nr_reconnects;
 	struct nvmf_ctrl_options *opts;
+	struct delayed_work connect_work;
+	struct work_struct err_work;
 };
 
 struct nvme_subsystem {
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4468d672ced9..779c2c043242 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -101,12 +101,9 @@ struct nvme_rdma_ctrl {
 
 	/* other member variables */
 	struct blk_mq_tag_set	tag_set;
-	struct work_struct	err_work;
 
 	struct nvme_rdma_qe	async_event_sqe;
 
-	struct delayed_work	reconnect_work;
-
 	struct list_head	list;
 
 	struct blk_mq_tag_set	admin_tag_set;
@@ -910,8 +907,8 @@ static void nvme_rdma_stop_ctrl(struct nvme_ctrl *nctrl)
 {
 	struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
 
-	cancel_work_sync(&ctrl->err_work);
-	cancel_delayed_work_sync(&ctrl->reconnect_work);
+	cancel_work_sync(&ctrl->ctrl.err_work);
+	cancel_delayed_work_sync(&ctrl->ctrl.connect_work);
 }
 
 static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
@@ -943,7 +940,7 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl)
 	if (nvmf_should_reconnect(&ctrl->ctrl)) {
 		dev_info(ctrl->ctrl.device, "Reconnecting in %d seconds...\n",
 			ctrl->ctrl.opts->reconnect_delay);
-		queue_delayed_work(nvme_wq, &ctrl->reconnect_work,
+		queue_delayed_work(nvme_wq, &ctrl->ctrl.connect_work,
 				ctrl->ctrl.opts->reconnect_delay * HZ);
 	} else {
 		nvme_delete_ctrl(&ctrl->ctrl);
@@ -1015,7 +1012,7 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
 static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 {
 	struct nvme_rdma_ctrl *ctrl = container_of(to_delayed_work(work),
-			struct nvme_rdma_ctrl, reconnect_work);
+			struct nvme_rdma_ctrl, ctrl.connect_work);
 
 	++ctrl->ctrl.nr_reconnects;
 
@@ -1038,7 +1035,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 static void nvme_rdma_error_recovery_work(struct work_struct *work)
 {
 	struct nvme_rdma_ctrl *ctrl = container_of(work,
-			struct nvme_rdma_ctrl, err_work);
+			struct nvme_rdma_ctrl, ctrl.err_work);
 
 	nvme_stop_keep_alive(&ctrl->ctrl);
 	nvme_rdma_teardown_io_queues(ctrl, false);
@@ -1059,7 +1056,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
 		return;
 
-	queue_work(nvme_wq, &ctrl->err_work);
+	queue_work(nvme_wq, &ctrl->ctrl.err_work);
 }
 
 static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
@@ -1932,9 +1929,9 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 		goto out_free_ctrl;
 	}
 
-	INIT_DELAYED_WORK(&ctrl->reconnect_work,
+	INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
 			nvme_rdma_reconnect_ctrl_work);
-	INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
+	INIT_WORK(&ctrl->ctrl.err_work, nvme_rdma_error_recovery_work);
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work);
 
 	ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 07/14] nvme-core: add work elements to struct nvme_ctrl
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

connect_work and err_work will be reused by nvme-tcp so
share those in nvme_ctrl for rdma and fc to share.

Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 drivers/nvme/host/fc.c   | 18 ++++++++----------
 drivers/nvme/host/nvme.h |  2 ++
 drivers/nvme/host/rdma.c | 19 ++++++++-----------
 3 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 83131e42b336..16812e427e17 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -159,8 +159,6 @@ struct nvme_fc_ctrl {
 	struct blk_mq_tag_set	admin_tag_set;
 	struct blk_mq_tag_set	tag_set;
 
-	struct delayed_work	connect_work;
-
 	struct kref		ref;
 	u32			flags;
 	u32			iocnt;
@@ -547,7 +545,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
 			"NVME-FC{%d}: connectivity re-established. "
 			"Attempting reconnect\n", ctrl->cnum);
 
-		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
+		queue_delayed_work(nvme_wq, &ctrl->ctrl.connect_work, 0);
 		break;
 
 	case NVME_CTRL_RESETTING:
@@ -2815,7 +2813,7 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
 {
 	struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);
 
-	cancel_delayed_work_sync(&ctrl->connect_work);
+	cancel_delayed_work_sync(&ctrl->ctrl.connect_work);
 	/*
 	 * kill the association on the link side.  this will block
 	 * waiting for io to terminate
@@ -2850,7 +2848,7 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
 		else if (time_after(jiffies + recon_delay, rport->dev_loss_end))
 			recon_delay = rport->dev_loss_end - jiffies;
 
-		queue_delayed_work(nvme_wq, &ctrl->connect_work, recon_delay);
+		queue_delayed_work(nvme_wq, &ctrl->ctrl.connect_work, recon_delay);
 	} else {
 		if (portptr->port_state == FC_OBJSTATE_ONLINE)
 			dev_warn(ctrl->ctrl.device,
@@ -2918,7 +2916,7 @@ nvme_fc_connect_ctrl_work(struct work_struct *work)
 
 	struct nvme_fc_ctrl *ctrl =
 			container_of(to_delayed_work(work),
-				struct nvme_fc_ctrl, connect_work);
+				struct nvme_fc_ctrl, ctrl.connect_work);
 
 	ret = nvme_fc_create_association(ctrl);
 	if (ret)
@@ -3015,7 +3013,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 	kref_init(&ctrl->ref);
 
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
-	INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
+	INIT_DELAYED_WORK(&ctrl->ctrl.connect_work, nvme_fc_connect_ctrl_work);
 	spin_lock_init(&ctrl->lock);
 
 	/* io queue count */
@@ -3086,7 +3084,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 
 	nvme_get_ctrl(&ctrl->ctrl);
 
-	if (!queue_delayed_work(nvme_wq, &ctrl->connect_work, 0)) {
+	if (!queue_delayed_work(nvme_wq, &ctrl->ctrl.connect_work, 0)) {
 		nvme_put_ctrl(&ctrl->ctrl);
 		dev_err(ctrl->ctrl.device,
 			"NVME-FC{%d}: failed to schedule initial connect\n",
@@ -3094,7 +3092,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 		goto fail_ctrl;
 	}
 
-	flush_delayed_work(&ctrl->connect_work);
+	flush_delayed_work(&ctrl->ctrl.connect_work);
 
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d}: new ctrl: NQN \"%s\"\n",
@@ -3105,7 +3103,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 fail_ctrl:
 	nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING);
 	cancel_work_sync(&ctrl->ctrl.reset_work);
-	cancel_delayed_work_sync(&ctrl->connect_work);
+	cancel_delayed_work_sync(&ctrl->ctrl.connect_work);
 
 	ctrl->ctrl.opts = NULL;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 27663ce3044e..031195e5d7d3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -240,6 +240,8 @@ struct nvme_ctrl {
 	u16 maxcmd;
 	int nr_reconnects;
 	struct nvmf_ctrl_options *opts;
+	struct delayed_work connect_work;
+	struct work_struct err_work;
 };
 
 struct nvme_subsystem {
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4468d672ced9..779c2c043242 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -101,12 +101,9 @@ struct nvme_rdma_ctrl {
 
 	/* other member variables */
 	struct blk_mq_tag_set	tag_set;
-	struct work_struct	err_work;
 
 	struct nvme_rdma_qe	async_event_sqe;
 
-	struct delayed_work	reconnect_work;
-
 	struct list_head	list;
 
 	struct blk_mq_tag_set	admin_tag_set;
@@ -910,8 +907,8 @@ static void nvme_rdma_stop_ctrl(struct nvme_ctrl *nctrl)
 {
 	struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
 
-	cancel_work_sync(&ctrl->err_work);
-	cancel_delayed_work_sync(&ctrl->reconnect_work);
+	cancel_work_sync(&ctrl->ctrl.err_work);
+	cancel_delayed_work_sync(&ctrl->ctrl.connect_work);
 }
 
 static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
@@ -943,7 +940,7 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl)
 	if (nvmf_should_reconnect(&ctrl->ctrl)) {
 		dev_info(ctrl->ctrl.device, "Reconnecting in %d seconds...\n",
 			ctrl->ctrl.opts->reconnect_delay);
-		queue_delayed_work(nvme_wq, &ctrl->reconnect_work,
+		queue_delayed_work(nvme_wq, &ctrl->ctrl.connect_work,
 				ctrl->ctrl.opts->reconnect_delay * HZ);
 	} else {
 		nvme_delete_ctrl(&ctrl->ctrl);
@@ -1015,7 +1012,7 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
 static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 {
 	struct nvme_rdma_ctrl *ctrl = container_of(to_delayed_work(work),
-			struct nvme_rdma_ctrl, reconnect_work);
+			struct nvme_rdma_ctrl, ctrl.connect_work);
 
 	++ctrl->ctrl.nr_reconnects;
 
@@ -1038,7 +1035,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 static void nvme_rdma_error_recovery_work(struct work_struct *work)
 {
 	struct nvme_rdma_ctrl *ctrl = container_of(work,
-			struct nvme_rdma_ctrl, err_work);
+			struct nvme_rdma_ctrl, ctrl.err_work);
 
 	nvme_stop_keep_alive(&ctrl->ctrl);
 	nvme_rdma_teardown_io_queues(ctrl, false);
@@ -1059,7 +1056,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
 		return;
 
-	queue_work(nvme_wq, &ctrl->err_work);
+	queue_work(nvme_wq, &ctrl->ctrl.err_work);
 }
 
 static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
@@ -1932,9 +1929,9 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 		goto out_free_ctrl;
 	}
 
-	INIT_DELAYED_WORK(&ctrl->reconnect_work,
+	INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
 			nvme_rdma_reconnect_ctrl_work);
-	INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
+	INIT_WORK(&ctrl->ctrl.err_work, nvme_rdma_error_recovery_work);
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work);
 
 	ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 08/14] nvmet: Add install_queue callout
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

nvmet-tcp will implement it to allocate queue commands which
are only known at nvmf connect time (sq size).

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 drivers/nvme/target/fabrics-cmd.c | 9 +++++++++
 drivers/nvme/target/nvmet.h       | 1 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 328ae46d8344..86ebf191b035 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -121,6 +121,15 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 		req->rsp->sq_head = cpu_to_le16(0xffff);
 	}
 
+	if (ctrl->ops->install_queue) {
+		u16 ret = ctrl->ops->install_queue(req->sq);
+		if (ret) {
+			pr_err("failed to install queue %d cntlid %d ret %x\n",
+				qid, ret, ctrl->cntlid);
+			return ret;
+		}
+	}
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 547108c41ce9..957eb7edb902 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -279,6 +279,7 @@ struct nvmet_fabrics_ops {
 	void (*delete_ctrl)(struct nvmet_ctrl *ctrl);
 	void (*disc_traddr)(struct nvmet_req *req,
 			struct nvmet_port *port, char *traddr);
+	u16 (*install_queue)(struct nvmet_sq *nvme_sq);
 };
 
 #define NVMET_MAX_INLINE_BIOVEC	8
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 08/14] nvmet: Add install_queue callout
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

nvmet-tcp will implement it to allocate queue commands which
are only known at nvmf connect time (sq size).

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 drivers/nvme/target/fabrics-cmd.c | 9 +++++++++
 drivers/nvme/target/nvmet.h       | 1 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 328ae46d8344..86ebf191b035 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -121,6 +121,15 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 		req->rsp->sq_head = cpu_to_le16(0xffff);
 	}
 
+	if (ctrl->ops->install_queue) {
+		u16 ret = ctrl->ops->install_queue(req->sq);
+		if (ret) {
+			pr_err("failed to install queue %d cntlid %d ret %x\n",
+				qid, ret, ctrl->cntlid);
+			return ret;
+		}
+	}
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 547108c41ce9..957eb7edb902 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -279,6 +279,7 @@ struct nvmet_fabrics_ops {
 	void (*delete_ctrl)(struct nvmet_ctrl *ctrl);
 	void (*disc_traddr)(struct nvmet_req *req,
 			struct nvmet_port *port, char *traddr);
+	u16 (*install_queue)(struct nvmet_sq *nvme_sq);
 };
 
 #define NVMET_MAX_INLINE_BIOVEC	8
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 09/14] nvmet: allow configfs tcp trtype configuration
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 drivers/nvme/target/configfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index db2cb64be7ba..618bbd006544 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -34,6 +34,7 @@ static const struct nvmet_transport_name {
 } nvmet_transport_names[] = {
 	{ NVMF_TRTYPE_RDMA,	"rdma" },
 	{ NVMF_TRTYPE_FC,	"fc" },
+	{ NVMF_TRTYPE_TCP,	"tcp" },
 	{ NVMF_TRTYPE_LOOP,	"loop" },
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 09/14] nvmet: allow configfs tcp trtype configuration
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 drivers/nvme/target/configfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index db2cb64be7ba..618bbd006544 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -34,6 +34,7 @@ static const struct nvmet_transport_name {
 } nvmet_transport_names[] = {
 	{ NVMF_TRTYPE_RDMA,	"rdma" },
 	{ NVMF_TRTYPE_FC,	"fc" },
+	{ NVMF_TRTYPE_TCP,	"tcp" },
 	{ NVMF_TRTYPE_LOOP,	"loop" },
 };
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 10/14] nvme-fabrics: allow user passing header digest
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

Header digest is a nvme-tcp specific feature, but nothing prevents other
transports reusing the concept so do not associate with tcp transport
solely.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 drivers/nvme/host/fabrics.c | 5 +++++
 drivers/nvme/host/fabrics.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 10074ac7731b..4272f8a95db3 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -614,6 +614,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_HOST_ID,		"hostid=%s"		},
 	{ NVMF_OPT_DUP_CONNECT,		"duplicate_connect"	},
 	{ NVMF_OPT_DISABLE_SQFLOW,	"disable_sqflow"	},
+	{ NVMF_OPT_HDR_DIGEST,		"hdr_digest"		},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -633,6 +634,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
 	opts->kato = NVME_DEFAULT_KATO;
 	opts->duplicate_connect = false;
+	opts->hdr_digest = false;
 
 	options = o = kstrdup(buf, GFP_KERNEL);
 	if (!options)
@@ -827,6 +829,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		case NVMF_OPT_DISABLE_SQFLOW:
 			opts->disable_sqflow = true;
 			break;
+		case NVMF_OPT_HDR_DIGEST:
+			opts->hdr_digest = true;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index ecd9a006a091..a6127f1a9e8e 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -59,6 +59,7 @@ enum {
 	NVMF_OPT_HOST_ID	= 1 << 12,
 	NVMF_OPT_DUP_CONNECT	= 1 << 13,
 	NVMF_OPT_DISABLE_SQFLOW = 1 << 14,
+	NVMF_OPT_HDR_DIGEST	= 1 << 15,
 };
 
 /**
@@ -103,6 +104,7 @@ struct nvmf_ctrl_options {
 	struct nvmf_host	*host;
 	int			max_reconnects;
 	bool			disable_sqflow;
+	bool			hdr_digest;
 };
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 10/14] nvme-fabrics: allow user passing header digest
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

Header digest is a nvme-tcp specific feature, but nothing prevents other
transports reusing the concept so do not associate with tcp transport
solely.

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 drivers/nvme/host/fabrics.c | 5 +++++
 drivers/nvme/host/fabrics.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 10074ac7731b..4272f8a95db3 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -614,6 +614,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_HOST_ID,		"hostid=%s"		},
 	{ NVMF_OPT_DUP_CONNECT,		"duplicate_connect"	},
 	{ NVMF_OPT_DISABLE_SQFLOW,	"disable_sqflow"	},
+	{ NVMF_OPT_HDR_DIGEST,		"hdr_digest"		},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -633,6 +634,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
 	opts->kato = NVME_DEFAULT_KATO;
 	opts->duplicate_connect = false;
+	opts->hdr_digest = false;
 
 	options = o = kstrdup(buf, GFP_KERNEL);
 	if (!options)
@@ -827,6 +829,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		case NVMF_OPT_DISABLE_SQFLOW:
 			opts->disable_sqflow = true;
 			break;
+		case NVMF_OPT_HDR_DIGEST:
+			opts->hdr_digest = true;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index ecd9a006a091..a6127f1a9e8e 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -59,6 +59,7 @@ enum {
 	NVMF_OPT_HOST_ID	= 1 << 12,
 	NVMF_OPT_DUP_CONNECT	= 1 << 13,
 	NVMF_OPT_DISABLE_SQFLOW = 1 << 14,
+	NVMF_OPT_HDR_DIGEST	= 1 << 15,
 };
 
 /**
@@ -103,6 +104,7 @@ struct nvmf_ctrl_options {
 	struct nvmf_host	*host;
 	int			max_reconnects;
 	bool			disable_sqflow;
+	bool			hdr_digest;
 };
 
 /*
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 11/14] nvme-fabrics: allow user passing data digest
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

Data digest is a nvme-tcp specific feature, but nothing prevents other
transports reusing the concept so do not associate with tcp transport
solely.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 drivers/nvme/host/fabrics.c | 5 +++++
 drivers/nvme/host/fabrics.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 4272f8a95db3..9c62c6838b76 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -615,6 +615,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_DUP_CONNECT,		"duplicate_connect"	},
 	{ NVMF_OPT_DISABLE_SQFLOW,	"disable_sqflow"	},
 	{ NVMF_OPT_HDR_DIGEST,		"hdr_digest"		},
+	{ NVMF_OPT_DATA_DIGEST,		"data_digest"		},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -635,6 +636,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->kato = NVME_DEFAULT_KATO;
 	opts->duplicate_connect = false;
 	opts->hdr_digest = false;
+	opts->data_digest = false;
 
 	options = o = kstrdup(buf, GFP_KERNEL);
 	if (!options)
@@ -832,6 +834,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		case NVMF_OPT_HDR_DIGEST:
 			opts->hdr_digest = true;
 			break;
+		case NVMF_OPT_DATA_DIGEST:
+			opts->data_digest = true;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index a6127f1a9e8e..524a02a67817 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -60,6 +60,7 @@ enum {
 	NVMF_OPT_DUP_CONNECT	= 1 << 13,
 	NVMF_OPT_DISABLE_SQFLOW = 1 << 14,
 	NVMF_OPT_HDR_DIGEST	= 1 << 15,
+	NVMF_OPT_DATA_DIGEST	= 1 << 16,
 };
 
 /**
@@ -105,6 +106,7 @@ struct nvmf_ctrl_options {
 	int			max_reconnects;
 	bool			disable_sqflow;
 	bool			hdr_digest;
+	bool			data_digest;
 };
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 11/14] nvme-fabrics: allow user passing data digest
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

Data digest is a nvme-tcp specific feature, but nothing prevents other
transports reusing the concept so do not associate with tcp transport
solely.

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 drivers/nvme/host/fabrics.c | 5 +++++
 drivers/nvme/host/fabrics.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 4272f8a95db3..9c62c6838b76 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -615,6 +615,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_DUP_CONNECT,		"duplicate_connect"	},
 	{ NVMF_OPT_DISABLE_SQFLOW,	"disable_sqflow"	},
 	{ NVMF_OPT_HDR_DIGEST,		"hdr_digest"		},
+	{ NVMF_OPT_DATA_DIGEST,		"data_digest"		},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -635,6 +636,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->kato = NVME_DEFAULT_KATO;
 	opts->duplicate_connect = false;
 	opts->hdr_digest = false;
+	opts->data_digest = false;
 
 	options = o = kstrdup(buf, GFP_KERNEL);
 	if (!options)
@@ -832,6 +834,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		case NVMF_OPT_HDR_DIGEST:
 			opts->hdr_digest = true;
 			break;
+		case NVMF_OPT_DATA_DIGEST:
+			opts->data_digest = true;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index a6127f1a9e8e..524a02a67817 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -60,6 +60,7 @@ enum {
 	NVMF_OPT_DUP_CONNECT	= 1 << 13,
 	NVMF_OPT_DISABLE_SQFLOW = 1 << 14,
 	NVMF_OPT_HDR_DIGEST	= 1 << 15,
+	NVMF_OPT_DATA_DIGEST	= 1 << 16,
 };
 
 /**
@@ -105,6 +106,7 @@ struct nvmf_ctrl_options {
 	int			max_reconnects;
 	bool			disable_sqflow;
 	bool			hdr_digest;
+	bool			data_digest;
 };
 
 /*
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 12/14] nvme-tcp: Add protocol header
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 include/linux/nvme-tcp.h | 188 +++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h     |   1 +
 2 files changed, 189 insertions(+)
 create mode 100644 include/linux/nvme-tcp.h

diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
new file mode 100644
index 000000000000..33c8afaf63bd
--- /dev/null
+++ b/include/linux/nvme-tcp.h
@@ -0,0 +1,188 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP protocol header.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+
+#ifndef _LINUX_NVME_TCP_H
+#define _LINUX_NVME_TCP_H
+
+#include <linux/nvme.h>
+
+#define NVME_TCP_DISC_PORT	8009
+#define NVME_TCP_ADMIN_CCSZ	SZ_8K
+
+enum nvme_tcp_pfv {
+	NVME_TCP_PFV_1_0 = 0x0,
+};
+
+enum nvme_tcp_fatal_error_status {
+	NVME_TCP_FES_INVALID_PDU_HDR		= 0x01,
+	NVME_TCP_FES_PDU_SEQ_ERR		= 0x02,
+	NVME_TCP_FES_HDR_DIGEST_ERR		= 0x03,
+	NVME_TCP_FES_DATA_OUT_OF_RANGE		= 0x04,
+	NVME_TCP_FES_R2T_LIMIT_EXCEEDED		= 0x05,
+	NVME_TCP_FES_DATA_LIMIT_EXCEEDED	= 0x05,
+	NVME_TCP_FES_UNSUPPORTED_PARAM		= 0x06,
+};
+
+enum nvme_tcp_digest_option {
+	NVME_TCP_HDR_DIGEST_ENABLE	= (1 << 0),
+	NVME_TCP_DATA_DIGEST_ENABLE	= (1 << 1),
+};
+
+enum nvme_tcp_pdu_type {
+	nvme_tcp_icreq		= 0x0,
+	nvme_tcp_icresp		= 0x1,
+	nvme_tcp_h2c_term	= 0x2,
+	nvme_tcp_c2h_term	= 0x3,
+	nvme_tcp_cmd		= 0x4,
+	nvme_tcp_rsp		= 0x5,
+	nvme_tcp_h2c_data	= 0x6,
+	nvme_tcp_c2h_data	= 0x7,
+	nvme_tcp_r2t		= 0x9,
+};
+
+enum nvme_tcp_pdu_flags {
+	NVME_TCP_F_HDGST		= (1 << 0),
+	NVME_TCP_F_DDGST		= (1 << 1),
+	NVME_TCP_F_DATA_LAST		= (1 << 2),
+	NVME_TCP_F_DATA_SUCCESS		= (1 << 3),
+};
+
+/**
+ * struct nvme_tcp_hdr - nvme tcp pdu common header
+ *
+ * @type:          pdu type
+ * @flags:         pdu specific flags
+ * @hlen:          pdu header length
+ * @pdo:           pdu data offset
+ * @plen:          pdu wire byte length
+ */
+struct nvme_tcp_hdr {
+	__u8	type;
+	__u8	flags;
+	__u8	hlen;
+	__u8	pdo;
+	__le32	plen;
+};
+
+/**
+ * struct nvme_tcp_icreq_pdu - nvme tcp initialize connection request pdu
+ *
+ * @hdr:           pdu generic header
+ * @pfv:           pdu version format
+ * @hpda:          host pdu data alignment (dwords, 0's based)
+ * @digest:        digest types enabled
+ * @maxr2t:        maximum r2ts per request supported
+ */
+struct nvme_tcp_icreq_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__le16			pfv;
+	__u8			hpda;
+	__u8			digest;
+	__le32			maxr2t;
+	__u8			rsvd2[112];
+};
+
+/**
+ * struct nvme_tcp_icresp_pdu - nvme tcp initialize connection response pdu
+ *
+ * @hdr:           pdu common header
+ * @pfv:           pdu version format
+ * @cpda:          controller pdu data alignment (dowrds, 0's based)
+ * @digest:        digest types enabled
+ * @maxdata:       maximum data capsules per r2t supported
+ */
+struct nvme_tcp_icresp_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__le16			pfv;
+	__u8			cpda;
+	__u8			digest;
+	__le32			maxdata;
+	__u8			rsvd[112];
+};
+
+/**
+ * struct nvme_tcp_term_pdu - nvme tcp terminate connection pdu
+ *
+ * @hdr:           pdu common header
+ * @fes:           fatal error status
+ * @fei:           fatal error information
+ */
+struct nvme_tcp_term_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__le16			fes;
+	__le32			fei;
+	__u8			rsvd[8];
+};
+
+/**
+ * struct nvme_tcp_cmd_pdu - nvme tcp command capsule pdu
+ *
+ * @hdr:           pdu common header
+ * @cmd:           nvme command
+ */
+struct nvme_tcp_cmd_pdu {
+	struct nvme_tcp_hdr	hdr;
+	struct nvme_command	cmd;
+};
+
+/**
+ * struct nvme_tcp_rsp_pdu - nvme tcp response capsule pdu
+ *
+ * @hdr:           pdu common header
+ * @hdr:           nvme-tcp generic header
+ * @cqe:           nvme completion queue entry
+ */
+struct nvme_tcp_rsp_pdu {
+	struct nvme_tcp_hdr	hdr;
+	struct nvme_completion	cqe;
+};
+
+/**
+ * struct nvme_tcp_r2t_pdu - nvme tcp ready-to-transfer pdu
+ *
+ * @hdr:           pdu common header
+ * @command_id:    nvme command identifier which this relates to
+ * @ttag:          transfer tag (controller generated)
+ * @r2t_offset:    offset from the start of the command data
+ * @r2t_length:    length the host is allowed to send
+ */
+struct nvme_tcp_r2t_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__u16			command_id;
+	__u16			ttag;
+	__le32			r2t_offset;
+	__le32			r2t_length;
+	__u8			rsvd[4];
+};
+
+/**
+ * struct nvme_tcp_data_pdu - nvme tcp data pdu
+ *
+ * @hdr:           pdu common header
+ * @command_id:    nvme command identifier which this relates to
+ * @ttag:          transfer tag (controller generated)
+ * @data_offset:   offset from the start of the command data
+ * @data_length:   length of the data stream
+ */
+struct nvme_tcp_data_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__u16			command_id;
+	__u16			ttag;
+	__le32			data_offset;
+	__le32			data_length;
+	__u8			rsvd[4];
+};
+
+union nvme_tcp_pdu {
+	struct nvme_tcp_icreq_pdu	icreq;
+	struct nvme_tcp_icresp_pdu	icresp;
+	struct nvme_tcp_cmd_pdu		cmd;
+	struct nvme_tcp_rsp_pdu		rsp;
+	struct nvme_tcp_r2t_pdu		r2t;
+	struct nvme_tcp_data_pdu	data;
+};
+
+#endif /* _LINUX_NVME_TCP_H */
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 2f29b480042b..ad767fa0e902 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -52,6 +52,7 @@ enum {
 enum {
 	NVMF_TRTYPE_RDMA	= 1,	/* RDMA */
 	NVMF_TRTYPE_FC		= 2,	/* Fibre Channel */
+	NVMF_TRTYPE_TCP		= 3,	/* TCP/IP */
 	NVMF_TRTYPE_LOOP	= 254,	/* Reserved for host usage */
 	NVMF_TRTYPE_MAX,
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 12/14] nvme-tcp: Add protocol header
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 include/linux/nvme-tcp.h | 188 +++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h     |   1 +
 2 files changed, 189 insertions(+)
 create mode 100644 include/linux/nvme-tcp.h

diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
new file mode 100644
index 000000000000..33c8afaf63bd
--- /dev/null
+++ b/include/linux/nvme-tcp.h
@@ -0,0 +1,188 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP protocol header.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+
+#ifndef _LINUX_NVME_TCP_H
+#define _LINUX_NVME_TCP_H
+
+#include <linux/nvme.h>
+
+#define NVME_TCP_DISC_PORT	8009
+#define NVME_TCP_ADMIN_CCSZ	SZ_8K
+
+enum nvme_tcp_pfv {
+	NVME_TCP_PFV_1_0 = 0x0,
+};
+
+enum nvme_tcp_fatal_error_status {
+	NVME_TCP_FES_INVALID_PDU_HDR		= 0x01,
+	NVME_TCP_FES_PDU_SEQ_ERR		= 0x02,
+	NVME_TCP_FES_HDR_DIGEST_ERR		= 0x03,
+	NVME_TCP_FES_DATA_OUT_OF_RANGE		= 0x04,
+	NVME_TCP_FES_R2T_LIMIT_EXCEEDED		= 0x05,
+	NVME_TCP_FES_DATA_LIMIT_EXCEEDED	= 0x05,
+	NVME_TCP_FES_UNSUPPORTED_PARAM		= 0x06,
+};
+
+enum nvme_tcp_digest_option {
+	NVME_TCP_HDR_DIGEST_ENABLE	= (1 << 0),
+	NVME_TCP_DATA_DIGEST_ENABLE	= (1 << 1),
+};
+
+enum nvme_tcp_pdu_type {
+	nvme_tcp_icreq		= 0x0,
+	nvme_tcp_icresp		= 0x1,
+	nvme_tcp_h2c_term	= 0x2,
+	nvme_tcp_c2h_term	= 0x3,
+	nvme_tcp_cmd		= 0x4,
+	nvme_tcp_rsp		= 0x5,
+	nvme_tcp_h2c_data	= 0x6,
+	nvme_tcp_c2h_data	= 0x7,
+	nvme_tcp_r2t		= 0x9,
+};
+
+enum nvme_tcp_pdu_flags {
+	NVME_TCP_F_HDGST		= (1 << 0),
+	NVME_TCP_F_DDGST		= (1 << 1),
+	NVME_TCP_F_DATA_LAST		= (1 << 2),
+	NVME_TCP_F_DATA_SUCCESS		= (1 << 3),
+};
+
+/**
+ * struct nvme_tcp_hdr - nvme tcp pdu common header
+ *
+ * @type:          pdu type
+ * @flags:         pdu specific flags
+ * @hlen:          pdu header length
+ * @pdo:           pdu data offset
+ * @plen:          pdu wire byte length
+ */
+struct nvme_tcp_hdr {
+	__u8	type;
+	__u8	flags;
+	__u8	hlen;
+	__u8	pdo;
+	__le32	plen;
+};
+
+/**
+ * struct nvme_tcp_icreq_pdu - nvme tcp initialize connection request pdu
+ *
+ * @hdr:           pdu generic header
+ * @pfv:           pdu version format
+ * @hpda:          host pdu data alignment (dwords, 0's based)
+ * @digest:        digest types enabled
+ * @maxr2t:        maximum r2ts per request supported
+ */
+struct nvme_tcp_icreq_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__le16			pfv;
+	__u8			hpda;
+	__u8			digest;
+	__le32			maxr2t;
+	__u8			rsvd2[112];
+};
+
+/**
+ * struct nvme_tcp_icresp_pdu - nvme tcp initialize connection response pdu
+ *
+ * @hdr:           pdu common header
+ * @pfv:           pdu version format
+ * @cpda:          controller pdu data alignment (dowrds, 0's based)
+ * @digest:        digest types enabled
+ * @maxdata:       maximum data capsules per r2t supported
+ */
+struct nvme_tcp_icresp_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__le16			pfv;
+	__u8			cpda;
+	__u8			digest;
+	__le32			maxdata;
+	__u8			rsvd[112];
+};
+
+/**
+ * struct nvme_tcp_term_pdu - nvme tcp terminate connection pdu
+ *
+ * @hdr:           pdu common header
+ * @fes:           fatal error status
+ * @fei:           fatal error information
+ */
+struct nvme_tcp_term_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__le16			fes;
+	__le32			fei;
+	__u8			rsvd[8];
+};
+
+/**
+ * struct nvme_tcp_cmd_pdu - nvme tcp command capsule pdu
+ *
+ * @hdr:           pdu common header
+ * @cmd:           nvme command
+ */
+struct nvme_tcp_cmd_pdu {
+	struct nvme_tcp_hdr	hdr;
+	struct nvme_command	cmd;
+};
+
+/**
+ * struct nvme_tcp_rsp_pdu - nvme tcp response capsule pdu
+ *
+ * @hdr:           pdu common header
+ * @hdr:           nvme-tcp generic header
+ * @cqe:           nvme completion queue entry
+ */
+struct nvme_tcp_rsp_pdu {
+	struct nvme_tcp_hdr	hdr;
+	struct nvme_completion	cqe;
+};
+
+/**
+ * struct nvme_tcp_r2t_pdu - nvme tcp ready-to-transfer pdu
+ *
+ * @hdr:           pdu common header
+ * @command_id:    nvme command identifier which this relates to
+ * @ttag:          transfer tag (controller generated)
+ * @r2t_offset:    offset from the start of the command data
+ * @r2t_length:    length the host is allowed to send
+ */
+struct nvme_tcp_r2t_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__u16			command_id;
+	__u16			ttag;
+	__le32			r2t_offset;
+	__le32			r2t_length;
+	__u8			rsvd[4];
+};
+
+/**
+ * struct nvme_tcp_data_pdu - nvme tcp data pdu
+ *
+ * @hdr:           pdu common header
+ * @command_id:    nvme command identifier which this relates to
+ * @ttag:          transfer tag (controller generated)
+ * @data_offset:   offset from the start of the command data
+ * @data_length:   length of the data stream
+ */
+struct nvme_tcp_data_pdu {
+	struct nvme_tcp_hdr	hdr;
+	__u16			command_id;
+	__u16			ttag;
+	__le32			data_offset;
+	__le32			data_length;
+	__u8			rsvd[4];
+};
+
+union nvme_tcp_pdu {
+	struct nvme_tcp_icreq_pdu	icreq;
+	struct nvme_tcp_icresp_pdu	icresp;
+	struct nvme_tcp_cmd_pdu		cmd;
+	struct nvme_tcp_rsp_pdu		rsp;
+	struct nvme_tcp_r2t_pdu		r2t;
+	struct nvme_tcp_data_pdu	data;
+};
+
+#endif /* _LINUX_NVME_TCP_H */
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 2f29b480042b..ad767fa0e902 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -52,6 +52,7 @@ enum {
 enum {
 	NVMF_TRTYPE_RDMA	= 1,	/* RDMA */
 	NVMF_TRTYPE_FC		= 2,	/* Fibre Channel */
+	NVMF_TRTYPE_TCP		= 3,	/* TCP/IP */
 	NVMF_TRTYPE_LOOP	= 254,	/* Reserved for host usage */
 	NVMF_TRTYPE_MAX,
 };
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 13/14] nvmet-tcp: add NVMe over TCP target driver
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

This patch implements the TCP transport driver for the NVMe over Fabrics
target stack. This allows exporting NVMe over Fabrics functionality over
good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
---
 drivers/nvme/target/Kconfig  |   10 +
 drivers/nvme/target/Makefile |    2 +
 drivers/nvme/target/tcp.c    | 1741 ++++++++++++++++++++++++++++++++++
 include/linux/nvme-tcp.h     |    1 +
 4 files changed, 1754 insertions(+)
 create mode 100644 drivers/nvme/target/tcp.c

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 3c7b61ddb0d1..d94f25cde019 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -60,3 +60,13 @@ config NVME_TARGET_FCLOOP
 	  to test NVMe-FC transport interfaces.
 
 	  If unsure, say N.
+
+config NVME_TARGET_TCP
+	tristate "NVMe over Fabrics TCP target support"
+	depends on INET
+	depends on NVME_TARGET
+	help
+	  This enables the NVMe TCP target support, which allows exporting NVMe
+	  devices over TCP.
+
+	  If unsure, say N.
diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
index 8118c93391c6..8c3ad0fb6860 100644
--- a/drivers/nvme/target/Makefile
+++ b/drivers/nvme/target/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_NVME_TARGET_LOOP)		+= nvme-loop.o
 obj-$(CONFIG_NVME_TARGET_RDMA)		+= nvmet-rdma.o
 obj-$(CONFIG_NVME_TARGET_FC)		+= nvmet-fc.o
 obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
+obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o
 
 nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
 			discovery.o io-cmd-file.o io-cmd-bdev.o
@@ -12,3 +13,4 @@ nvme-loop-y	+= loop.o
 nvmet-rdma-y	+= rdma.o
 nvmet-fc-y	+= fc.o
 nvme-fcloop-y	+= fcloop.o
+nvmet-tcp-y	+= tcp.o
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
new file mode 100644
index 000000000000..61311e518072
--- /dev/null
+++ b/drivers/nvme/target/tcp.c
@@ -0,0 +1,1741 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP target.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/nvme-tcp.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <linux/inet.h>
+#include <linux/llist.h>
+#include <crypto/hash.h>
+
+#include "nvmet.h"
+
+#define NVMET_TCP_DEF_INLINE_DATA_SIZE	(4 * PAGE_SIZE)
+
+#define NVMET_TCP_RECV_BUDGET		8
+#define NVMET_TCP_SEND_BUDGET		8
+#define NVMET_TCP_IO_WORK_BUDGET	64
+
+enum nvmet_tcp_send_state {
+	NVMET_TCP_SEND_DATA_PDU = 0,
+	NVMET_TCP_SEND_DATA,
+	NVMET_TCP_SEND_R2T,
+	NVMET_TCP_SEND_DDGST,
+	NVMET_TCP_SEND_RESPONSE
+};
+
+enum nvmet_tcp_recv_state {
+	NVMET_TCP_RECV_PDU,
+	NVMET_TCP_RECV_DATA,
+	NVMET_TCP_RECV_DDGST,
+	NVMET_TCP_RECV_ERR,
+};
+
+struct nvmet_tcp_send_ctx {
+	u32			offset;
+	struct scatterlist	*cur_sg;
+	enum nvmet_tcp_send_state state;
+};
+
+enum {
+	NVMET_TCP_F_INIT_FAILED = (1 << 0),
+};
+
+struct nvmet_tcp_cmd {
+	struct nvmet_tcp_queue		*queue;
+	struct nvmet_req		req;
+
+	struct nvme_tcp_cmd_pdu		*cmd_pdu;
+	struct nvme_tcp_rsp_pdu		*rsp_pdu;
+	struct nvme_tcp_data_pdu	*data_pdu;
+	struct nvme_tcp_r2t_pdu		*r2t_pdu;
+
+	u32				rbytes_done;
+	u32				wbytes_done;
+
+	u32				pdu_len;
+	u32				pdu_recv;
+	int				sg_idx;
+	int				nr_mapped;
+	struct msghdr			recv_msg;
+	struct kvec			*iov;
+	u32				flags;
+
+	struct list_head		entry;
+	struct llist_node		lentry;
+	struct nvmet_tcp_send_ctx	snd;
+	__le32				exp_ddgst;
+	__le32				recv_ddgst;
+};
+
+enum nvmet_tcp_queue_state {
+	NVMET_TCP_Q_CONNECTING,
+	NVMET_TCP_Q_LIVE,
+	NVMET_TCP_Q_DISCONNECTING,
+};
+
+struct nvmet_tcp_recv_ctx {
+	union nvme_tcp_pdu		pdu;
+	int				offset;
+	int				left;
+	enum nvmet_tcp_recv_state	state;
+	struct nvmet_tcp_cmd		*cmd;
+};
+
+struct nvmet_tcp_queue {
+	struct socket		*sock;
+	struct nvmet_tcp_port	*port;
+
+	struct nvmet_tcp_cmd	*cmds;
+	unsigned		nr_cmds;
+	struct list_head	free_list;
+	struct llist_head	resp_list;
+	struct list_head	resp_send_list;
+	int			send_list_len;
+
+	spinlock_t		state_lock;
+	enum nvmet_tcp_queue_state state;
+	struct nvmet_cq		nvme_cq;
+	struct nvmet_sq		nvme_sq;
+
+	struct sockaddr_storage	sockaddr;
+	struct sockaddr_storage	sockaddr_peer;
+	struct work_struct	release_work;
+	struct work_struct	io_work;
+
+	int			idx;
+	int			cpu;
+
+	struct list_head	queue_list;
+	struct nvmet_tcp_cmd	*snd_cmd;
+	struct nvmet_tcp_recv_ctx rcv;
+
+	bool			hdr_digest;
+	bool			data_digest;
+	struct ahash_request	*snd_hash;
+	struct ahash_request	*rcv_hash;
+
+	struct nvmet_tcp_cmd	connect;
+
+	struct page_frag_cache	pf_cache;
+
+	void (*old_data_ready)(struct sock *);
+	void (*old_state_change)(struct sock *);
+	void (*old_write_space)(struct sock *);
+};
+
+struct nvmet_tcp_port {
+	struct socket		*sock;
+	struct work_struct	accept_work;
+	struct nvmet_port	*nport;
+	struct sockaddr_storage addr;
+	int			last_cpu;
+	void (*old_data_ready) (struct sock *);
+};
+
+static DEFINE_IDA(nvmet_tcp_queue_ida);
+static LIST_HEAD(nvmet_tcp_queue_list);
+static DEFINE_MUTEX(nvmet_tcp_queue_mutex);
+
+static struct workqueue_struct *nvmet_tcp_wq;
+static struct nvmet_fabrics_ops nvmet_tcp_ops;
+static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
+static void nvmet_tcp_finish_cmd(struct nvmet_tcp_cmd *cmd);
+
+static inline u16 nvmet_tcp_cmd_id(struct nvmet_tcp_queue *queue,
+		struct nvmet_tcp_cmd *cmd)
+{
+	return cmd - queue->cmds;
+}
+
+static inline bool nvmet_tcp_has_data_in(struct nvmet_tcp_cmd *cmd)
+{
+	return nvme_is_write(cmd->req.cmd) &&
+		cmd->rbytes_done < cmd->req.transfer_len;
+}
+
+static inline bool nvmet_tcp_need_data_in(struct nvmet_tcp_cmd *cmd)
+{
+	return nvmet_tcp_has_data_in(cmd) && !cmd->req.rsp->status;
+}
+
+static inline bool nvmet_tcp_need_data_out(struct nvmet_tcp_cmd *cmd)
+{
+	return !nvme_is_write(cmd->req.cmd) &&
+		cmd->req.transfer_len > 0 &&
+		!cmd->req.rsp->status;
+}
+
+static inline bool nvmet_tcp_has_inline_data(struct nvmet_tcp_cmd *cmd)
+{
+	return nvme_is_write(cmd->req.cmd) && cmd->pdu_len &&
+		!cmd->rbytes_done;
+}
+
+static inline struct nvmet_tcp_cmd *
+nvmet_tcp_get_cmd(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd *cmd;
+
+	cmd = list_first_entry_or_null(&queue->free_list,
+				struct nvmet_tcp_cmd, entry);
+	if (!cmd)
+		return NULL;
+	list_del_init(&cmd->entry);
+
+	cmd->rbytes_done = cmd->wbytes_done = 0;
+	cmd->pdu_len = 0;
+	cmd->pdu_recv = 0;
+	cmd->iov = NULL;
+	cmd->flags = 0;
+	return cmd;
+}
+
+static inline void nvmet_tcp_put_cmd(struct nvmet_tcp_cmd *cmd)
+{
+	if (unlikely(cmd == &cmd->queue->connect))
+		return;
+
+	list_add_tail(&cmd->entry, &cmd->queue->free_list);
+}
+
+static inline u8 nvmet_tcp_hdgst_len(struct nvmet_tcp_queue *queue)
+{
+	return queue->hdr_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline u8 nvmet_tcp_ddgst_len(struct nvmet_tcp_queue *queue)
+{
+	return queue->data_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline void nvmet_tcp_hdgst(struct ahash_request *hash,
+		void *pdu, size_t len)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, pdu, len);
+	ahash_request_set_crypt(hash, &sg, pdu + len, len);
+	crypto_ahash_digest(hash);
+}
+
+static int nvmet_tcp_verify_hdgst(struct nvmet_tcp_queue *queue,
+	void *pdu, size_t len)
+{
+	struct nvme_tcp_hdr *hdr = pdu;
+	__le32 recv_digest;
+	__le32 exp_digest;
+
+	if (unlikely(!(hdr->flags & NVME_TCP_F_HDGST))) {
+		pr_err("queue %d: header digest enabled but pdu without digest\n",
+			queue->idx);
+		return -EPROTO;
+	}
+
+	recv_digest = *(__le32 *)(pdu + hdr->hlen);
+	nvmet_tcp_hdgst(queue->rcv_hash, pdu, len);
+	exp_digest = *(__le32 *)(pdu + hdr->hlen);
+	if (recv_digest != exp_digest) {
+		pr_err("queue %d: header digest error: recv %#x expected %#x\n",
+			queue->idx, le32_to_cpu(recv_digest),
+			le32_to_cpu(exp_digest));
+		return -EPROTO;
+	}
+
+	return 0;
+}
+
+static int nvmet_tcp_check_ddgst(struct nvmet_tcp_queue *queue, void *pdu)
+{
+	struct nvme_tcp_hdr *hdr = pdu;
+	u32 len;
+
+	len = le32_to_cpu(hdr->plen) - hdr->hlen -
+		(hdr->flags & NVME_TCP_F_HDGST ? nvmet_tcp_hdgst_len(queue) : 0);
+
+	if (unlikely(len && !(hdr->flags & NVME_TCP_F_DDGST))) {
+		pr_err("queue %d: data digest flag is cleared\n", queue->idx);
+		return -EPROTO;
+	}
+
+	return 0;
+}
+
+static void nvmet_tcp_unmap_pdu_iovec(struct nvmet_tcp_cmd *cmd)
+{
+	struct scatterlist *sg;
+	int i;
+
+	sg = &cmd->req.sg[cmd->sg_idx];
+
+	for (i = 0; i < cmd->nr_mapped; i++)
+		kunmap(sg_page(&sg[i]));
+}
+
+static void nvmet_tcp_map_pdu_iovec(struct nvmet_tcp_cmd *cmd)
+{
+	struct kvec *iov = cmd->iov;
+	struct scatterlist *sg;
+	u32 length, offset, sg_offset;
+
+	length = cmd->pdu_len;
+	cmd->nr_mapped = DIV_ROUND_UP(length, PAGE_SIZE);
+	offset = cmd->rbytes_done;
+	cmd->sg_idx = DIV_ROUND_UP(offset, PAGE_SIZE);
+	sg_offset = offset % PAGE_SIZE;
+	sg = &cmd->req.sg[cmd->sg_idx];
+
+	while (length) {
+		u32 iov_len = min_t(u32, length, sg->length - sg_offset);
+
+		iov->iov_base = kmap(sg_page(sg)) + sg->offset + sg_offset;
+		iov->iov_len = iov_len;
+
+		length -= iov_len;
+		sg = sg_next(sg);
+		iov++;
+	}
+
+	iov_iter_kvec(&cmd->recv_msg.msg_iter, READ, cmd->iov,
+		cmd->nr_mapped, cmd->pdu_len);
+}
+
+static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
+{
+	queue->rcv.state = NVMET_TCP_RECV_ERR;
+	if (queue->nvme_sq.ctrl)
+		nvmet_ctrl_fatal_error(queue->nvme_sq.ctrl);
+	else
+		kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+}
+
+static int nvmet_tcp_map_data(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvme_sgl_desc *sgl = &cmd->req.cmd->common.dptr.sgl;
+	u32 len = le32_to_cpu(sgl->length);
+
+	if (!cmd->req.data_len)
+		return 0;
+
+	if (sgl->type == ((NVME_SGL_FMT_DATA_DESC << 4) |
+			  NVME_SGL_FMT_OFFSET)) {
+		if (!nvme_is_write(cmd->req.cmd))
+			return NVME_SC_INVALID_FIELD | NVME_SC_DNR;
+
+		if (len > cmd->req.port->inline_data_size)
+			return NVME_SC_SGL_INVALID_OFFSET | NVME_SC_DNR;
+		cmd->pdu_len = len;
+	}
+	cmd->req.transfer_len += len;
+
+	cmd->req.sg = sgl_alloc(len, GFP_KERNEL, &cmd->req.sg_cnt);
+	if (!cmd->req.sg)
+		return NVME_SC_INTERNAL;
+	cmd->snd.cur_sg = cmd->req.sg;
+
+	if (nvmet_tcp_has_data_in(cmd)) {
+		cmd->iov = kmalloc_array(cmd->req.sg_cnt,
+				sizeof(*cmd->iov), GFP_KERNEL);
+		if (!cmd->iov)
+			goto err;
+	}
+
+	return 0;
+err:
+	sgl_free(cmd->req.sg);
+	return NVME_SC_INTERNAL;
+}
+
+static void nvmet_tcp_ddgst(struct ahash_request *hash,
+		struct nvmet_tcp_cmd *cmd)
+{
+	ahash_request_set_crypt(hash, cmd->req.sg,
+		(void *)&cmd->exp_ddgst, cmd->req.transfer_len);
+	crypto_ahash_digest(hash);
+}
+
+static void nvmet_setup_c2h_data_pdu(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvme_tcp_data_pdu *pdu = cmd->data_pdu;
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+	u8 ddgst = nvmet_tcp_ddgst_len(cmd->queue);
+
+	cmd->snd.offset = 0;
+	cmd->snd.state = NVMET_TCP_SEND_DATA_PDU;
+
+	pdu->hdr.type = nvme_tcp_c2h_data;
+	pdu->hdr.flags = NVME_TCP_F_DATA_LAST;
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.pdo = pdu->hdr.hlen + hdgst;
+	pdu->hdr.plen =
+		cpu_to_le32(pdu->hdr.hlen + hdgst + cmd->req.transfer_len + ddgst);
+	pdu->command_id = cmd->req.rsp->command_id;
+	pdu->data_length = cpu_to_le32(cmd->req.transfer_len);
+	pdu->data_offset = cpu_to_le32(cmd->wbytes_done);
+
+	if (queue->data_digest) {
+		pdu->hdr.flags |= NVME_TCP_F_DDGST;
+		nvmet_tcp_ddgst(queue->snd_hash, cmd);
+	}
+
+	if (cmd->queue->hdr_digest) {
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+		nvmet_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+	}
+}
+
+static void nvmet_setup_r2t_pdu(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvme_tcp_r2t_pdu *pdu = cmd->r2t_pdu;
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+
+	cmd->snd.offset = 0;
+	cmd->snd.state = NVMET_TCP_SEND_R2T;
+
+	pdu->hdr.type = nvme_tcp_r2t;
+	pdu->hdr.flags = 0;
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.pdo = 0;
+	pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+
+	pdu->command_id = cmd->req.cmd->common.command_id;
+	pdu->ttag = nvmet_tcp_cmd_id(cmd->queue, cmd);
+	pdu->r2t_length = cpu_to_le32(cmd->req.transfer_len - cmd->rbytes_done);
+	pdu->r2t_offset = cpu_to_le32(cmd->rbytes_done);
+	if (cmd->queue->hdr_digest) {
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+		nvmet_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+	}
+}
+
+static void nvmet_setup_response_pdu(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvme_tcp_rsp_pdu *pdu = cmd->rsp_pdu;
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+
+	cmd->snd.offset = 0;
+	cmd->snd.state = NVMET_TCP_SEND_RESPONSE;
+
+	pdu->hdr.type = nvme_tcp_rsp;
+	pdu->hdr.flags = 0;
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.pdo = 0;
+	pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+	if (cmd->queue->hdr_digest) {
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+		nvmet_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+	}
+}
+
+static struct nvmet_tcp_cmd *nvmet_tcp_reverse_list(struct nvmet_tcp_queue *queue, struct llist_node *node)
+{
+	struct nvmet_tcp_cmd *cmd;
+
+	while (node) {
+		struct nvmet_tcp_cmd *cmd = container_of(node, struct nvmet_tcp_cmd, lentry);
+
+		list_add(&cmd->entry, &queue->resp_send_list);
+		node = node->next;
+		queue->send_list_len++;
+	}
+
+	cmd = list_first_entry(&queue->resp_send_list, struct nvmet_tcp_cmd, entry);
+	return cmd;
+}
+
+static struct nvmet_tcp_cmd *nvmet_tcp_fetch_send_command(struct nvmet_tcp_queue *queue)
+{
+	struct llist_node *node;
+
+	queue->snd_cmd = list_first_entry_or_null(&queue->resp_send_list,
+				struct nvmet_tcp_cmd, entry);
+	if (!queue->snd_cmd) {
+		node = llist_del_all(&queue->resp_list);
+		if (!node)
+			return NULL;
+		queue->snd_cmd = nvmet_tcp_reverse_list(queue, node);
+	}
+
+	list_del_init(&queue->snd_cmd->entry);
+	queue->send_list_len--;
+
+	if (nvmet_tcp_need_data_out(queue->snd_cmd))
+		nvmet_setup_c2h_data_pdu(queue->snd_cmd);
+	else if (nvmet_tcp_need_data_in(queue->snd_cmd))
+		nvmet_setup_r2t_pdu(queue->snd_cmd);
+	else
+		nvmet_setup_response_pdu(queue->snd_cmd);
+
+	return queue->snd_cmd;
+}
+
+static void nvmet_tcp_queue_response(struct nvmet_req *req)
+{
+	struct nvmet_tcp_cmd *cmd =
+		container_of(req, struct nvmet_tcp_cmd, req);
+	struct nvmet_tcp_queue	*queue = cmd->queue;
+
+	llist_add(&cmd->lentry, &queue->resp_list);
+	queue_work_on(cmd->queue->cpu, nvmet_tcp_wq, &cmd->queue->io_work);
+}
+
+static int nvmet_try_send_data_pdu(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvmet_tcp_send_ctx *snd = &cmd->snd;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+	int left = sizeof(*cmd->data_pdu) - snd->offset + hdgst;
+	int ret;
+
+	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
+			offset_in_page(cmd->data_pdu) + snd->offset,
+			left, MSG_DONTWAIT | MSG_MORE);
+	if (ret <= 0)
+		return ret;
+
+	snd->offset += ret;
+	left -= ret;
+
+	if (left)
+		return -EAGAIN;
+
+	snd->state = NVMET_TCP_SEND_DATA;
+	snd->offset  = 0;
+	return 1;
+}
+
+static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvmet_tcp_send_ctx *snd = &cmd->snd;
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	int ret;
+
+	while (snd->cur_sg) {
+		struct page *page = sg_page(snd->cur_sg);
+		u32 left = snd->cur_sg->length - snd->offset;
+
+		ret = kernel_sendpage(cmd->queue->sock, page, snd->offset,
+					left, MSG_DONTWAIT | MSG_MORE);
+		if (ret <= 0)
+			return ret;
+
+		snd->offset += ret;
+		cmd->wbytes_done += ret;
+
+		/* Done with sg?*/
+		if (snd->offset == snd->cur_sg->length) {
+			snd->cur_sg = sg_next(snd->cur_sg);
+			snd->offset = 0;
+		}
+	}
+
+	if (queue->data_digest) {
+		cmd->snd.state = NVMET_TCP_SEND_DDGST;
+		snd->offset = 0;
+	} else {
+		nvmet_setup_response_pdu(cmd);
+	}
+	return 1;
+
+}
+
+static int nvmet_try_send_response(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
+{
+	struct nvmet_tcp_send_ctx *snd = &cmd->snd;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+	int left = sizeof(*cmd->rsp_pdu) - snd->offset + hdgst;
+	int flags = MSG_DONTWAIT;
+	int ret;
+
+	if (!last_in_batch && cmd->queue->send_list_len)
+		flags |= MSG_MORE;
+	else
+		flags |= MSG_EOR;
+
+	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
+			offset_in_page(cmd->rsp_pdu) + snd->offset, left, flags);
+	if (ret <= 0)
+		return ret;
+	snd->offset += ret;
+	left -= ret;
+
+	if (left)
+		return -EAGAIN;
+
+	kfree(cmd->iov);
+	sgl_free(cmd->req.sg);
+	cmd->queue->snd_cmd = NULL;
+	nvmet_tcp_put_cmd(cmd);
+	return 1;
+}
+
+static int nvmet_try_send_r2t(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
+{
+	struct nvmet_tcp_send_ctx *snd = &cmd->snd;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+	int left = sizeof(*cmd->r2t_pdu) - snd->offset + hdgst;
+	int flags = MSG_DONTWAIT;
+	int ret;
+
+	if (!last_in_batch && cmd->queue->send_list_len)
+		flags |= MSG_MORE;
+	else
+		flags |= MSG_EOR;
+
+	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
+			offset_in_page(cmd->r2t_pdu) + snd->offset, left, flags);
+	if (ret <= 0)
+		return ret;
+	snd->offset += ret;
+	left -= ret;
+
+	if (left)
+		return -EAGAIN;
+
+	cmd->queue->snd_cmd = NULL;
+	return 1;
+}
+
+static int nvmet_try_send_ddgst(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+	struct kvec iov = {
+		.iov_base = &cmd->exp_ddgst + cmd->snd.offset,
+		.iov_len = NVME_TCP_DIGEST_LENGTH - cmd->snd.offset
+	};
+	int ret;
+
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+	if (unlikely(ret <= 0))
+		return ret;
+
+	cmd->snd.offset += ret;
+	nvmet_setup_response_pdu(cmd);
+	return 1;
+}
+
+static int nvmet_tcp_try_send_one(struct nvmet_tcp_queue *queue,
+		bool last_in_batch)
+{
+	struct nvmet_tcp_cmd *cmd = queue->snd_cmd;
+	int ret = 0;
+
+	if (!cmd || queue->state == NVMET_TCP_Q_DISCONNECTING) {
+		cmd = nvmet_tcp_fetch_send_command(queue);
+		if (unlikely(!cmd))
+			return 0;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_DATA_PDU) {
+		ret = nvmet_try_send_data_pdu(cmd);
+		if (ret <= 0)
+			goto done_send;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_DATA) {
+		ret = nvmet_try_send_data(cmd);
+		if (ret <= 0)
+			goto done_send;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_DDGST) {
+		ret = nvmet_try_send_ddgst(cmd);
+		if (ret <= 0)
+			goto done_send;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_R2T) {
+		ret = nvmet_try_send_r2t(cmd, last_in_batch);
+		if (ret <= 0)
+			goto done_send;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_RESPONSE)
+		ret = nvmet_try_send_response(cmd, last_in_batch);
+
+done_send:
+	if (ret < 0) {
+		if (ret == -EAGAIN)
+			return 0;
+		return ret;
+	}
+
+	return 1;
+}
+
+static int nvmet_tcp_try_send(struct nvmet_tcp_queue *queue,
+		int budget, int *sends)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < budget; i++) {
+		ret = nvmet_tcp_try_send_one(queue, i == budget - 1);
+		if (ret <= 0)
+			break;
+		(*sends)++;
+	}
+
+	return ret;
+}
+
+static void nvmet_prepare_receive_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+
+	rcv->offset = 0;
+	rcv->left = sizeof(struct nvme_tcp_hdr);
+	rcv->cmd = NULL;
+	rcv->state = NVMET_TCP_RECV_PDU;
+}
+
+static void nvmet_tcp_free_crypto(struct nvmet_tcp_queue *queue)
+{
+	struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
+
+	ahash_request_free(queue->rcv_hash);
+	ahash_request_free(queue->snd_hash);
+	crypto_free_ahash(tfm);
+}
+
+static int nvmet_tcp_alloc_crypto(struct nvmet_tcp_queue *queue)
+{
+	struct crypto_ahash *tfm;
+
+	tfm = crypto_alloc_ahash("crc32c", 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(tfm))
+		return PTR_ERR(tfm);
+
+	queue->snd_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+	if (!queue->snd_hash)
+		goto free_tfm;
+	ahash_request_set_callback(queue->snd_hash, 0, NULL, NULL);
+
+	queue->rcv_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+	if (!queue->rcv_hash)
+		goto free_snd_hash;
+	ahash_request_set_callback(queue->rcv_hash, 0, NULL, NULL);
+
+	return 0;
+free_snd_hash:
+	ahash_request_free(queue->snd_hash);
+free_tfm:
+	crypto_free_ahash(tfm);
+	return -ENOMEM;
+}
+
+
+static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
+{
+	struct nvme_tcp_icreq_pdu *icreq = &queue->rcv.pdu.icreq;
+	struct nvme_tcp_icresp_pdu *icresp = &queue->rcv.pdu.icresp;
+	struct msghdr msg = {};
+	struct kvec iov;
+	int ret;
+
+	if (le32_to_cpu(icreq->hdr.plen) != sizeof(struct nvme_tcp_icreq_pdu)) {
+		pr_err("bad nvme-tcp pdu length (%d)\n",
+			le32_to_cpu(icreq->hdr.plen));
+		nvmet_tcp_fatal_error(queue);
+	}
+
+	if (icreq->pfv != NVME_TCP_PFV_1_0) {
+		pr_err("queue %d: bad pfv %d\n", queue->idx, icreq->pfv);
+		return -EINVAL;
+	}
+
+	queue->hdr_digest = !!(icreq->digest & NVME_TCP_HDR_DIGEST_ENABLE);
+	queue->data_digest = !!(icreq->digest & NVME_TCP_DATA_DIGEST_ENABLE);
+	if (queue->hdr_digest || queue->data_digest) {
+		ret = nvmet_tcp_alloc_crypto(queue);
+		if (ret)
+			return ret;
+	}
+
+	if (icreq->hpda != 0) {
+		pr_err("queue %d: unsupported hpda %d\n", queue->idx,
+			icreq->hpda);
+		ret = -EPROTO;
+		goto free_crypto;
+	}
+
+	memset(icresp, 0, sizeof(*icresp));
+	icresp->hdr.type = nvme_tcp_icresp;
+	icresp->hdr.hlen = sizeof(*icresp);
+	icresp->hdr.pdo = 0;
+	icresp->hdr.plen = cpu_to_le32(icresp->hdr.hlen);
+	icresp->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
+	icresp->maxdata = 0xffff; /* FIXME: support r2t */
+	icresp->cpda = 0;
+	if (queue->hdr_digest)
+		icresp->digest |= NVME_TCP_HDR_DIGEST_ENABLE;
+	if (queue->data_digest)
+		icresp->digest |= NVME_TCP_DATA_DIGEST_ENABLE;
+
+	iov.iov_base = icresp;
+	iov.iov_len = sizeof(*icresp);
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+	if (ret < 0)
+		goto free_crypto;
+
+	queue->state = NVMET_TCP_Q_LIVE;
+	nvmet_prepare_receive_pdu(queue);
+	return 0;
+free_crypto:
+	if (queue->hdr_digest || queue->data_digest)
+		nvmet_tcp_free_crypto(queue);
+	return ret;
+}
+
+static void nvmet_tcp_handle_req_failure(struct nvmet_tcp_queue *queue,
+		struct nvmet_tcp_cmd *cmd, struct nvmet_req *req)
+{
+	int ret;
+
+	/* recover the expected data transfer length */
+	req->data_len = le32_to_cpu(req->cmd->common.dptr.sgl.length);
+
+	if (!nvme_is_write(cmd->req.cmd) ||
+	    req->data_len > cmd->req.port->inline_data_size) {
+		nvmet_prepare_receive_pdu(queue);
+		return;
+	}
+
+	ret = nvmet_tcp_map_data(cmd);
+	if (unlikely(ret)) {
+		pr_err("queue %d: failed to map data\n", queue->idx);
+		nvmet_tcp_fatal_error(queue);
+		return;
+	}
+
+	queue->rcv.state = NVMET_TCP_RECV_DATA;
+	nvmet_tcp_map_pdu_iovec(cmd);
+	cmd->flags |= NVMET_TCP_F_INIT_FAILED;
+}
+
+static int nvmet_tcp_handle_h2c_data_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_data_pdu *data = &rcv->pdu.data;
+	struct nvmet_tcp_cmd *cmd;
+
+	cmd = &queue->cmds[data->ttag];
+
+	if (le32_to_cpu(data->data_offset) != cmd->rbytes_done) {
+		pr_err("queue %d ttag %u unexpected data offset %u (expected %u)\n",
+			queue->idx, data->ttag, le32_to_cpu(data->data_offset),
+			cmd->rbytes_done);
+		/* FIXME: use path and transport errors */
+		nvmet_req_complete(&cmd->req,
+			NVME_SC_INVALID_FIELD | NVME_SC_DNR);
+		return -EPROTO;
+	}
+
+	cmd->pdu_len = le32_to_cpu(data->data_length);
+	cmd->pdu_recv = 0;
+	nvmet_tcp_map_pdu_iovec(cmd);
+	rcv->cmd = cmd;
+	rcv->state = NVMET_TCP_RECV_DATA;
+
+	return 0;
+}
+
+static int nvmet_tcp_done_recv_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_hdr *hdr = &rcv->pdu.cmd.hdr;
+	struct nvme_command *nvme_cmd = &rcv->pdu.cmd.cmd;
+	struct nvmet_req *req;
+	int ret;
+
+	if (unlikely(queue->state == NVMET_TCP_Q_CONNECTING)) {
+		if (hdr->type != nvme_tcp_icreq) {
+			pr_err("unexpected pdu type (%d) before icreq\n",
+				hdr->type);
+			nvmet_tcp_fatal_error(queue);
+			return -EPROTO;
+		}
+		return nvmet_tcp_handle_icreq(queue);
+	}
+
+	if (hdr->type == nvme_tcp_h2c_data) {
+		ret = nvmet_tcp_handle_h2c_data_pdu(queue);
+		if(unlikely(ret))
+			return ret;
+		return 0;
+	}
+
+	rcv->cmd = nvmet_tcp_get_cmd(queue);
+	if (unlikely(!rcv->cmd)) {
+		/* This should never happen */
+		pr_err("queue %d: failed get command nr_cmds: %d, send_list_len: %d, opcode: %d",
+			queue->idx, queue->nr_cmds, queue->send_list_len, nvme_cmd->common.opcode);
+		nvmet_tcp_fatal_error(queue);
+		return -ENOMEM;
+	}
+
+	req = &rcv->cmd->req;
+	memcpy(req->cmd, nvme_cmd, sizeof(*nvme_cmd));
+
+	if (unlikely(!nvmet_req_init(req, &queue->nvme_cq,
+			&queue->nvme_sq, &nvmet_tcp_ops))) {
+		pr_err("failed cmd %p id %d opcode %d, data_len: %d\n",
+			req->cmd, req->cmd->common.command_id,
+			req->cmd->common.opcode,
+			le32_to_cpu(req->cmd->common.dptr.sgl.length));
+
+		nvmet_tcp_handle_req_failure(queue, rcv->cmd, req);
+		return -EAGAIN;
+	}
+
+	ret = nvmet_tcp_map_data(rcv->cmd);
+	if (unlikely(ret)) {
+		pr_err("queue %d: failed to map data\n", queue->idx);
+		if (nvmet_tcp_has_inline_data(rcv->cmd))
+			nvmet_tcp_fatal_error(queue);
+		else
+			nvmet_req_complete(req, ret);
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (nvmet_tcp_need_data_in(rcv->cmd)) {
+		if (nvmet_tcp_has_inline_data(rcv->cmd)) {
+			rcv->state = NVMET_TCP_RECV_DATA;
+			nvmet_tcp_map_pdu_iovec(rcv->cmd);
+			return 0;
+		} else {
+			/* send back R2T */
+			nvmet_tcp_queue_response(&rcv->cmd->req);
+			goto out;
+		}
+	}
+
+	nvmet_req_execute(&rcv->cmd->req);
+out:
+	nvmet_prepare_receive_pdu(queue);
+	return ret;
+}
+
+static const u8 nvme_tcp_pdu_sizes[] = {
+	[nvme_tcp_icreq]	= sizeof(struct nvme_tcp_icreq_pdu),
+	[nvme_tcp_cmd]		= sizeof(struct nvme_tcp_cmd_pdu),
+	[nvme_tcp_h2c_data]	= sizeof(struct nvme_tcp_data_pdu),
+};
+
+static inline u8 nvmet_tcp_pdu_size(u8 type)
+{
+	size_t idx = type;
+
+	return (idx < ARRAY_SIZE(nvme_tcp_pdu_sizes) && nvme_tcp_pdu_sizes[idx]) ?
+			nvme_tcp_pdu_sizes[idx] : 0;
+}
+
+static inline bool nvmet_tcp_pdu_valid(u8 type)
+{
+	switch (type) {
+	case nvme_tcp_icreq:
+	case nvme_tcp_cmd:
+	case nvme_tcp_h2c_data:
+		/* fallthru */
+		return true;
+	}
+
+	return false;
+}
+
+static int nvmet_tcp_try_recv_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_hdr *hdr = &rcv->pdu.cmd.hdr;
+	int len;
+	struct kvec iov;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+
+recv:
+	iov.iov_base = (void *)&rcv->pdu + rcv->offset;
+	iov.iov_len = rcv->left;
+	len = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (unlikely(len < 0))
+		return len;
+
+	rcv->offset += len;
+	rcv->left -= len;
+	if (rcv->left) {
+		return -EAGAIN;
+	} else if (rcv->offset == sizeof(struct nvme_tcp_hdr)) {
+		u8 hdgst = nvmet_tcp_hdgst_len(queue);
+
+		if (unlikely(!nvmet_tcp_pdu_valid(hdr->type))) {
+			pr_err("unexpected pdu type %d\n", hdr->type);
+			nvmet_tcp_fatal_error(queue);
+			return -EIO;
+		}
+
+		if (unlikely(hdr->hlen != nvmet_tcp_pdu_size(hdr->type))) {
+			pr_err("pdu type %d bad hlen %d\n", hdr->type, hdr->hlen);
+			return -EIO;
+		}
+
+		rcv->left = hdr->hlen - rcv->offset + hdgst;
+		goto recv;
+	}
+
+	if (queue->hdr_digest &&
+	    nvmet_tcp_verify_hdgst(queue, &rcv->pdu, rcv->offset)) {
+		nvmet_tcp_fatal_error(queue); /* fatal */
+		return -EPROTO;
+	}
+
+	if (queue->data_digest &&
+	    nvmet_tcp_check_ddgst(queue, &rcv->pdu)) {
+		nvmet_tcp_fatal_error(queue); /* fatal */
+		return -EPROTO;
+	}
+
+	return nvmet_tcp_done_recv_pdu(queue);
+}
+
+static void nvmet_tcp_prep_recv_ddgst(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvmet_tcp_queue *queue = cmd->queue;
+
+	nvmet_tcp_ddgst(queue->rcv_hash, cmd);
+	queue->rcv.offset = 0;
+	queue->rcv.left = NVME_TCP_DIGEST_LENGTH;
+	queue->rcv.state = NVMET_TCP_RECV_DDGST;
+}
+
+static int nvmet_tcp_try_recv_data(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd  *cmd = queue->rcv.cmd;
+	int ret;
+
+	while (msg_data_left(&cmd->recv_msg)) {
+		ret = sock_recvmsg(cmd->queue->sock, &cmd->recv_msg,
+			cmd->recv_msg.msg_flags);
+		if (ret <= 0)
+			return ret;
+
+		cmd->pdu_recv += ret;
+		cmd->rbytes_done += ret;
+	}
+
+	nvmet_tcp_unmap_pdu_iovec(cmd);
+
+	if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED) &&
+	    cmd->rbytes_done == cmd->req.transfer_len) {
+		if (queue->data_digest) {
+			nvmet_tcp_prep_recv_ddgst(cmd);
+			return 0;
+		} else {
+			nvmet_req_execute(&cmd->req);
+		}
+	}
+
+	nvmet_prepare_receive_pdu(queue);
+	return 0;
+}
+
+static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	struct nvmet_tcp_cmd *cmd = rcv->cmd;
+	int ret;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+	struct kvec iov = {
+		.iov_base = (void *)&cmd->recv_ddgst + rcv->offset,
+		.iov_len = rcv->left
+	};
+
+	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (unlikely(ret < 0))
+		return ret;
+
+	rcv->offset += ret;
+	rcv->left -= ret;
+	if (rcv->left)
+		return -EAGAIN;
+
+	if (queue->data_digest && cmd->exp_ddgst != cmd->recv_ddgst) {
+		pr_err("queue %d: cmd %d pdu (%d) data digest error: recv %#x expected %#x\n",
+		queue->idx, cmd->req.cmd->common.command_id, rcv->pdu.cmd.hdr.type,
+		le32_to_cpu(cmd->recv_ddgst), le32_to_cpu(cmd->exp_ddgst));
+		nvmet_tcp_finish_cmd(cmd);
+		nvmet_tcp_fatal_error(queue);
+		ret = -EPROTO;
+		goto out;
+	}
+
+	if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED) &&
+	    cmd->rbytes_done == cmd->req.transfer_len)
+		nvmet_req_execute(&cmd->req);
+	ret = 0;
+out:
+	nvmet_prepare_receive_pdu(queue);
+	return ret;
+}
+
+static int nvmet_tcp_try_recv_one(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	int result;
+
+	if (unlikely(rcv->state == NVMET_TCP_RECV_ERR))
+		return 0;
+
+	if (rcv->state == NVMET_TCP_RECV_PDU) {
+		result = nvmet_tcp_try_recv_pdu(queue);
+		if (result != 0)
+			goto done_recv;
+	}
+
+	if (rcv->state == NVMET_TCP_RECV_DATA) {
+		result = nvmet_tcp_try_recv_data(queue);
+		if (result != 0)
+			goto done_recv;
+	}
+
+	if (rcv->state == NVMET_TCP_RECV_DDGST) {
+		result = nvmet_tcp_try_recv_ddgst(queue);
+		if (result != 0)
+			goto done_recv;
+	}
+
+done_recv:
+	if (result < 0) {
+		if (result == -EAGAIN)
+			return 0;
+		return result;
+	}
+	return 1;
+}
+
+static int nvmet_tcp_try_recv(struct nvmet_tcp_queue *queue,
+		int budget, int *recvs)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < budget; i++) {
+		ret = nvmet_tcp_try_recv_one(queue);
+		if (ret <= 0)
+			break;
+		(*recvs)++;
+	}
+
+	return ret;
+}
+
+static void nvmet_tcp_schedule_release_queue(struct nvmet_tcp_queue *queue)
+{
+	spin_lock(&queue->state_lock);
+	if (queue->state == NVMET_TCP_Q_DISCONNECTING)
+		goto out;
+
+	queue->state = NVMET_TCP_Q_DISCONNECTING;
+	schedule_work(&queue->release_work);
+out:
+	spin_unlock(&queue->state_lock);
+}
+
+static void nvmet_tcp_io_work(struct work_struct *w)
+{
+	struct nvmet_tcp_queue *queue =
+		container_of(w, struct nvmet_tcp_queue, io_work);
+	bool pending;
+	int ret, ops = 0;
+
+	do {
+		pending = false;
+
+		ret = nvmet_tcp_try_recv(queue, NVMET_TCP_RECV_BUDGET, &ops);
+		if (ret > 0) {
+			pending = true;
+		} else if (ret < 0) {
+			if (ret == -EPIPE || ret == -ECONNRESET)
+				kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+			else
+				nvmet_tcp_fatal_error(queue);
+			return;
+		}
+
+		ret = nvmet_tcp_try_send(queue, NVMET_TCP_SEND_BUDGET, &ops);
+		if (ret > 0) {
+			/* transmitted message/data */
+			pending = true;
+		} else if (ret < 0) {
+			if (ret == -EPIPE || ret == -ECONNRESET)
+				kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+			else
+				nvmet_tcp_fatal_error(queue);
+			return;
+		}
+
+	} while (pending && ops < NVMET_TCP_IO_WORK_BUDGET);
+
+	/*
+	 * We exahusted our budget, requeue our selves
+	 */
+	if (pending)
+		queue_work_on(queue->cpu, nvmet_tcp_wq, &queue->io_work);
+}
+
+static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
+		struct nvmet_tcp_cmd *c)
+{
+	u8 hdgst = nvmet_tcp_hdgst_len(queue);
+
+	c->queue = queue;
+	c->req.port = queue->port->nport;
+
+	c->cmd_pdu = page_frag_alloc(&queue->pf_cache,
+			sizeof(*c->cmd_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	if (!c->cmd_pdu)
+		return -ENOMEM;
+	c->req.cmd = &c->cmd_pdu->cmd;
+
+	c->rsp_pdu = page_frag_alloc(&queue->pf_cache,
+			sizeof(*c->rsp_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	if (!c->rsp_pdu)
+		goto out_free_cmd;
+	c->req.rsp = &c->rsp_pdu->cqe;
+
+	c->data_pdu = page_frag_alloc(&queue->pf_cache,
+			sizeof(*c->data_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	if (!c->data_pdu)
+		goto out_free_rsp;
+
+	c->r2t_pdu = page_frag_alloc(&queue->pf_cache,
+			sizeof(*c->r2t_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	if (!c->r2t_pdu)
+		goto out_free_data;
+
+	c->recv_msg.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
+
+	list_add_tail(&c->entry, &queue->free_list);
+
+	return 0;
+out_free_data:
+	page_frag_free(c->data_pdu);
+out_free_rsp:
+	page_frag_free(c->rsp_pdu);
+out_free_cmd:
+	page_frag_free(c->cmd_pdu);
+	return -ENOMEM;
+}
+
+static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c)
+{
+	page_frag_free(c->r2t_pdu);
+	page_frag_free(c->data_pdu);
+	page_frag_free(c->rsp_pdu);
+	page_frag_free(c->cmd_pdu);
+}
+
+static int nvmet_tcp_alloc_cmds(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd *cmds;
+	int i, ret = -EINVAL, nr_cmds = queue->nr_cmds;
+
+	cmds = kcalloc(nr_cmds, sizeof(struct nvmet_tcp_cmd), GFP_KERNEL);
+	if (!cmds)
+		goto out;
+
+	for (i = 0; i < nr_cmds; i++) {
+		ret = nvmet_tcp_alloc_cmd(queue, cmds + i);
+		if (ret)
+			goto out_free;
+	}
+
+	queue->cmds = cmds;
+
+	return 0;
+out_free:
+	while (--i >= 0)
+		nvmet_tcp_free_cmd(cmds + i);
+	kfree(cmds);
+out:
+	return ret;
+}
+
+static void nvmet_tcp_free_cmds(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd *cmds = queue->cmds;
+	int i;
+
+	for (i = 0; i < queue->nr_cmds; i++)
+		nvmet_tcp_free_cmd(cmds + i);
+
+	nvmet_tcp_free_cmd(&queue->connect);
+	kfree(cmds);
+}
+
+static void nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
+{
+	struct socket *sock = queue->sock;
+
+	write_lock_bh(&sock->sk->sk_callback_lock);
+	sock->sk->sk_data_ready =  queue->old_data_ready;
+	sock->sk->sk_state_change = queue->old_state_change;
+	sock->sk->sk_write_space = queue->old_write_space;
+	sock->sk->sk_user_data = NULL;
+	write_unlock_bh(&sock->sk->sk_callback_lock);
+}
+
+static void nvmet_tcp_finish_cmd(struct nvmet_tcp_cmd *cmd)
+{
+	nvmet_req_uninit(&cmd->req);
+	nvmet_tcp_unmap_pdu_iovec(cmd);
+	sgl_free(cmd->req.sg);
+}
+
+static void nvmet_tcp_uninit_data_in_cmds(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd *cmd = queue->cmds;
+	int i;
+
+	for (i = 0; i < queue->nr_cmds; i++, cmd++) {
+		if (nvmet_tcp_need_data_in(cmd))
+			nvmet_tcp_finish_cmd(cmd);
+	}
+
+	if (!queue->nr_cmds && nvmet_tcp_need_data_in(&queue->connect)) {
+		/* failed in connect */
+		nvmet_tcp_finish_cmd(&queue->connect);
+	}
+}
+
+static void nvmet_tcp_release_queue_work(struct work_struct *w)
+{
+	struct nvmet_tcp_queue *queue =
+		container_of(w, struct nvmet_tcp_queue, release_work);
+
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_del_init(&queue->queue_list);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+
+	nvmet_tcp_restore_socket_callbacks(queue);
+	flush_work(&queue->io_work);
+
+	nvmet_tcp_uninit_data_in_cmds(queue);
+	nvmet_sq_destroy(&queue->nvme_sq);
+	cancel_work_sync(&queue->io_work);
+	sock_release(queue->sock);
+	nvmet_tcp_free_cmds(queue);
+	if (queue->hdr_digest || queue->data_digest)
+		nvmet_tcp_free_crypto(queue);
+	ida_simple_remove(&nvmet_tcp_queue_ida, queue->idx);
+
+	kfree(queue);
+}
+
+static void nvmet_tcp_data_ready(struct sock *sk)
+{
+	struct nvmet_tcp_queue *queue;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (!queue)
+		goto out;
+
+	queue_work_on(queue->cpu, nvmet_tcp_wq, &queue->io_work);
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvmet_tcp_write_space(struct sock *sk)
+{
+	struct nvmet_tcp_queue *queue;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (!queue)
+		goto out;
+
+	if (unlikely(queue->state == NVMET_TCP_Q_CONNECTING)) {
+		queue->old_write_space(sk);
+		goto out;
+	}
+
+	if (sk_stream_is_writeable(sk)) {
+		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+		queue_work_on(queue->cpu, nvmet_tcp_wq, &queue->io_work);
+	}
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvmet_tcp_state_change(struct sock *sk)
+{
+	struct nvmet_tcp_queue *queue;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (!queue)
+		goto done;
+
+	switch (sk->sk_state) {
+	case TCP_FIN_WAIT1:
+	case TCP_CLOSE_WAIT:
+	case TCP_CLOSE:
+		/* FALLTHRU */
+		sk->sk_user_data = NULL;
+		nvmet_tcp_schedule_release_queue(queue);
+		break;
+	default:
+		pr_warn("queue %d unhandled state %d\n", queue->idx, sk->sk_state);
+	}
+done:
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
+{
+	struct socket *sock = queue->sock;
+	struct linger sol = { .l_onoff = 1, .l_linger = 0 };
+	int ret;
+
+	ret = kernel_getsockname(sock,
+		(struct sockaddr *)&queue->sockaddr);
+	if (ret < 0)
+		return ret;
+
+	ret = kernel_getpeername(sock,
+		(struct sockaddr *)&queue->sockaddr_peer);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Cleanup whatever is sitting in the TCP transmit queue on socket
+	 * close. This is done to prevent stale data from being sent should
+	 * the network connection be restored before TCP times out.
+	 */
+	ret = kernel_setsockopt(sock, SOL_SOCKET, SO_LINGER,
+			(char *)&sol, sizeof(sol));
+	if (ret)
+		return ret;
+
+	write_lock_bh(&sock->sk->sk_callback_lock);
+	sock->sk->sk_user_data = queue;
+	queue->old_data_ready = sock->sk->sk_data_ready;
+	sock->sk->sk_data_ready = nvmet_tcp_data_ready;
+	queue->old_state_change = sock->sk->sk_state_change;
+	sock->sk->sk_state_change = nvmet_tcp_state_change;
+	queue->old_write_space = sock->sk->sk_write_space;
+	sock->sk->sk_write_space = nvmet_tcp_write_space;
+	write_unlock_bh(&sock->sk->sk_callback_lock);
+
+	return 0;
+}
+
+static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
+		struct socket *newsock)
+{
+	struct nvmet_tcp_queue *queue;
+	int ret;
+
+	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+	if (!queue)
+		return -ENOMEM;
+
+	INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
+	INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
+	queue->sock = newsock;
+	queue->port = port;
+	queue->nr_cmds = 0;
+	spin_lock_init(&queue->state_lock);
+	queue->state = NVMET_TCP_Q_CONNECTING;
+	INIT_LIST_HEAD(&queue->free_list);
+	init_llist_head(&queue->resp_list);
+	INIT_LIST_HEAD(&queue->resp_send_list);
+
+	queue->idx = ida_simple_get(&nvmet_tcp_queue_ida, 0, 0, GFP_KERNEL);
+	if (queue->idx < 0) {
+		ret = queue->idx;
+		goto out_free_queue;
+	}
+
+	ret = nvmet_tcp_alloc_cmd(queue, &queue->connect);
+	if (ret)
+		goto out_ida_remove;
+
+	ret = nvmet_sq_init(&queue->nvme_sq);
+	if (ret)
+		goto out_ida_remove;
+
+	port->last_cpu = cpumask_next_wrap(port->last_cpu,
+				cpu_online_mask, -1, false);
+	queue->cpu = port->last_cpu;
+	nvmet_prepare_receive_pdu(queue);
+
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_add_tail(&queue->queue_list, &nvmet_tcp_queue_list);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+
+	ret = nvmet_tcp_set_queue_sock(queue);
+	if (ret)
+		goto out_destroy_sq;
+
+	queue_work_on(queue->cpu, nvmet_tcp_wq, &queue->io_work);
+
+	return 0;
+out_destroy_sq:
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_del_init(&queue->queue_list);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+	nvmet_sq_destroy(&queue->nvme_sq);
+out_ida_remove:
+	ida_simple_remove(&nvmet_tcp_queue_ida, queue->idx);
+out_free_queue:
+	kfree(queue);
+	return ret;
+}
+
+static void nvmet_tcp_accept_work(struct work_struct *w)
+{
+	struct nvmet_tcp_port *port =
+		container_of(w, struct nvmet_tcp_port, accept_work);
+	struct socket *newsock;
+	int ret;
+
+	while (true) {
+		ret = kernel_accept(port->sock, &newsock, O_NONBLOCK);
+		if (ret < 0) {
+			if (ret != -EAGAIN)
+				pr_warn("failed to accept err=%d\n", ret);
+			return;
+		}
+		ret = nvmet_tcp_alloc_queue(port, newsock);
+		if (ret) {
+			pr_err("failed to allocate queue\n");
+			sock_release(newsock);
+		}
+	}
+}
+
+static void nvmet_tcp_listen_data_ready(struct sock *sk)
+{
+	struct nvmet_tcp_port *port;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	port = sk->sk_user_data;
+	if (!port)
+		goto out;
+
+	if (sk->sk_state == TCP_LISTEN)
+		schedule_work(&port->accept_work);
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static int nvmet_tcp_add_port(struct nvmet_port *nport)
+{
+	struct nvmet_tcp_port *port;
+	__kernel_sa_family_t af;
+	int opt, ret;
+
+	port = kzalloc(sizeof(*port), GFP_KERNEL);
+	if (!port)
+		return -ENOMEM;
+
+	switch (nport->disc_addr.adrfam) {
+	case NVMF_ADDR_FAMILY_IP4:
+		af = AF_INET;
+		break;
+	case NVMF_ADDR_FAMILY_IP6:
+		af = AF_INET6;
+		break;
+	default:
+		pr_err("address family %d not supported\n",
+				nport->disc_addr.adrfam);
+		ret = -EINVAL;
+		goto err_port;
+	}
+
+	ret = inet_pton_with_scope(&init_net, af, nport->disc_addr.traddr,
+			nport->disc_addr.trsvcid, &port->addr);
+	if (ret) {
+		pr_err("malformed ip/port passed: %s:%s\n",
+			nport->disc_addr.traddr, nport->disc_addr.trsvcid);
+		goto err_port;
+	}
+
+	port->nport = nport;
+	port->last_cpu = -1;
+	INIT_WORK(&port->accept_work, nvmet_tcp_accept_work);
+	if (port->nport->inline_data_size < 0)
+		port->nport->inline_data_size = NVMET_TCP_DEF_INLINE_DATA_SIZE;
+
+	ret = sock_create(port->addr.ss_family, SOCK_STREAM,
+				IPPROTO_TCP, &port->sock);
+	if (ret) {
+		pr_err("failed to create a socket\n");
+		goto err_port;
+	}
+
+	port->sock->sk->sk_user_data = port;
+	port->old_data_ready = port->sock->sk->sk_data_ready;
+	port->sock->sk->sk_data_ready = nvmet_tcp_listen_data_ready;
+
+	opt = 1;
+	ret = kernel_setsockopt(port->sock, IPPROTO_TCP,
+			TCP_NODELAY, (char *)&opt, sizeof(opt));
+	if (ret) {
+		pr_err("failed to set TCP_NODELAY sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	ret = kernel_setsockopt(port->sock, SOL_SOCKET, SO_REUSEADDR,
+			(char *)&opt, sizeof(opt));
+	if (ret) {
+		pr_err("failed to set SO_REUSEADDR sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	ret = kernel_bind(port->sock, (struct sockaddr *)&port->addr,
+			sizeof(port->addr));
+	if (ret) {
+		pr_err("failed to bind port socket %d\n", ret);
+		goto err_sock;
+	}
+
+	ret = kernel_listen(port->sock, 128);
+	if (ret) {
+		pr_err("failed to listen %d on port sock\n", ret);
+		goto err_sock;
+	}
+
+	nport->priv = port;
+	pr_info("enabling port %d (%pISpc)\n",
+		le16_to_cpu(nport->disc_addr.portid), &port->addr);
+
+	return 0;
+
+err_sock:
+	sock_release(port->sock);
+err_port:
+	kfree(port);
+	return ret;
+}
+
+static void nvmet_tcp_remove_port(struct nvmet_port *nport)
+{
+	struct nvmet_tcp_port *port = nport->priv;
+
+	write_lock_bh(&port->sock->sk->sk_callback_lock);
+	port->sock->sk->sk_data_ready = port->old_data_ready;
+	port->sock->sk->sk_user_data = NULL;
+	write_unlock_bh(&port->sock->sk->sk_callback_lock);
+	cancel_work_sync(&port->accept_work);
+
+	sock_release(port->sock);
+	kfree(port);
+}
+
+static void nvmet_tcp_delete_ctrl(struct nvmet_ctrl *ctrl)
+{
+	struct nvmet_tcp_queue *queue;
+
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_for_each_entry(queue, &nvmet_tcp_queue_list, queue_list)
+		if (queue->nvme_sq.ctrl == ctrl)
+			kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+}
+
+static u16 nvmet_tcp_install_queue(struct nvmet_sq *sq)
+{
+	struct nvmet_tcp_queue *queue =
+		container_of(sq, struct nvmet_tcp_queue, nvme_sq);
+	int ret;
+
+	if (sq->qid == 0) {
+		/* Let inflight controller teardown complete */
+		flush_scheduled_work();
+	}
+
+	queue->nr_cmds = sq->size * 2;
+	if (nvmet_tcp_alloc_cmds(queue))
+		return NVME_SC_INTERNAL;
+	return 0;
+}
+
+static void nvmet_tcp_disc_port_addr(struct nvmet_req *req,
+		struct nvmet_port *nport, char *traddr)
+{
+	struct nvmet_tcp_port *port = nport->priv;
+
+	if (inet_addr_is_any((struct sockaddr *)&port->addr)) {
+		struct nvmet_tcp_cmd *cmd =
+			container_of(req, struct nvmet_tcp_cmd, req);
+		struct nvmet_tcp_queue *queue = cmd->queue;
+
+		sprintf(traddr, "%pISc", (struct sockaddr *)&queue->sockaddr);
+	} else {
+		memcpy(traddr, nport->disc_addr.traddr, NVMF_TRADDR_SIZE);
+	}
+}
+
+static struct nvmet_fabrics_ops nvmet_tcp_ops = {
+	.owner			= THIS_MODULE,
+	.type			= NVMF_TRTYPE_TCP,
+	.msdbd			= 1,
+	.has_keyed_sgls		= 0,
+	.add_port		= nvmet_tcp_add_port,
+	.remove_port		= nvmet_tcp_remove_port,
+	.queue_response		= nvmet_tcp_queue_response,
+	.delete_ctrl		= nvmet_tcp_delete_ctrl,
+	.install_queue		= nvmet_tcp_install_queue,
+	.disc_traddr		= nvmet_tcp_disc_port_addr,
+};
+
+static int __init nvmet_tcp_init(void)
+{
+	int ret;
+
+	nvmet_tcp_wq = alloc_workqueue("nvmet_tcp_wq", WQ_HIGHPRI, 0);
+	if (!nvmet_tcp_wq)
+		return -ENOMEM;
+
+	ret = nvmet_register_transport(&nvmet_tcp_ops);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	destroy_workqueue(nvmet_tcp_wq);
+	return ret;
+}
+
+static void __exit nvmet_tcp_exit(void)
+{
+	struct nvmet_tcp_queue *queue;
+
+	nvmet_unregister_transport(&nvmet_tcp_ops);
+
+	flush_scheduled_work();
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_for_each_entry(queue, &nvmet_tcp_queue_list, queue_list)
+		kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+	flush_scheduled_work();
+
+	destroy_workqueue(nvmet_tcp_wq);
+}
+
+module_init(nvmet_tcp_init);
+module_exit(nvmet_tcp_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS("nvmet-transport-3"); /* 3 == NVMF_TRTYPE_TCP */
diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
index 33c8afaf63bd..685780d1ed04 100644
--- a/include/linux/nvme-tcp.h
+++ b/include/linux/nvme-tcp.h
@@ -11,6 +11,7 @@
 
 #define NVME_TCP_DISC_PORT	8009
 #define NVME_TCP_ADMIN_CCSZ	SZ_8K
+#define NVME_TCP_DIGEST_LENGTH	4
 
 enum nvme_tcp_pfv {
 	NVME_TCP_PFV_1_0 = 0x0,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 13/14] nvmet-tcp: add NVMe over TCP target driver
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

This patch implements the TCP transport driver for the NVMe over Fabrics
target stack. This allows exporting NVMe over Fabrics functionality over
good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
Signed-off-by: Roy Shterman <roys at lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas at lightbitslabs.com>
---
 drivers/nvme/target/Kconfig  |   10 +
 drivers/nvme/target/Makefile |    2 +
 drivers/nvme/target/tcp.c    | 1741 ++++++++++++++++++++++++++++++++++
 include/linux/nvme-tcp.h     |    1 +
 4 files changed, 1754 insertions(+)
 create mode 100644 drivers/nvme/target/tcp.c

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 3c7b61ddb0d1..d94f25cde019 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -60,3 +60,13 @@ config NVME_TARGET_FCLOOP
 	  to test NVMe-FC transport interfaces.
 
 	  If unsure, say N.
+
+config NVME_TARGET_TCP
+	tristate "NVMe over Fabrics TCP target support"
+	depends on INET
+	depends on NVME_TARGET
+	help
+	  This enables the NVMe TCP target support, which allows exporting NVMe
+	  devices over TCP.
+
+	  If unsure, say N.
diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
index 8118c93391c6..8c3ad0fb6860 100644
--- a/drivers/nvme/target/Makefile
+++ b/drivers/nvme/target/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_NVME_TARGET_LOOP)		+= nvme-loop.o
 obj-$(CONFIG_NVME_TARGET_RDMA)		+= nvmet-rdma.o
 obj-$(CONFIG_NVME_TARGET_FC)		+= nvmet-fc.o
 obj-$(CONFIG_NVME_TARGET_FCLOOP)	+= nvme-fcloop.o
+obj-$(CONFIG_NVME_TARGET_TCP)		+= nvmet-tcp.o
 
 nvmet-y		+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
 			discovery.o io-cmd-file.o io-cmd-bdev.o
@@ -12,3 +13,4 @@ nvme-loop-y	+= loop.o
 nvmet-rdma-y	+= rdma.o
 nvmet-fc-y	+= fc.o
 nvme-fcloop-y	+= fcloop.o
+nvmet-tcp-y	+= tcp.o
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
new file mode 100644
index 000000000000..61311e518072
--- /dev/null
+++ b/drivers/nvme/target/tcp.c
@@ -0,0 +1,1741 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP target.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/nvme-tcp.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <linux/inet.h>
+#include <linux/llist.h>
+#include <crypto/hash.h>
+
+#include "nvmet.h"
+
+#define NVMET_TCP_DEF_INLINE_DATA_SIZE	(4 * PAGE_SIZE)
+
+#define NVMET_TCP_RECV_BUDGET		8
+#define NVMET_TCP_SEND_BUDGET		8
+#define NVMET_TCP_IO_WORK_BUDGET	64
+
+enum nvmet_tcp_send_state {
+	NVMET_TCP_SEND_DATA_PDU = 0,
+	NVMET_TCP_SEND_DATA,
+	NVMET_TCP_SEND_R2T,
+	NVMET_TCP_SEND_DDGST,
+	NVMET_TCP_SEND_RESPONSE
+};
+
+enum nvmet_tcp_recv_state {
+	NVMET_TCP_RECV_PDU,
+	NVMET_TCP_RECV_DATA,
+	NVMET_TCP_RECV_DDGST,
+	NVMET_TCP_RECV_ERR,
+};
+
+struct nvmet_tcp_send_ctx {
+	u32			offset;
+	struct scatterlist	*cur_sg;
+	enum nvmet_tcp_send_state state;
+};
+
+enum {
+	NVMET_TCP_F_INIT_FAILED = (1 << 0),
+};
+
+struct nvmet_tcp_cmd {
+	struct nvmet_tcp_queue		*queue;
+	struct nvmet_req		req;
+
+	struct nvme_tcp_cmd_pdu		*cmd_pdu;
+	struct nvme_tcp_rsp_pdu		*rsp_pdu;
+	struct nvme_tcp_data_pdu	*data_pdu;
+	struct nvme_tcp_r2t_pdu		*r2t_pdu;
+
+	u32				rbytes_done;
+	u32				wbytes_done;
+
+	u32				pdu_len;
+	u32				pdu_recv;
+	int				sg_idx;
+	int				nr_mapped;
+	struct msghdr			recv_msg;
+	struct kvec			*iov;
+	u32				flags;
+
+	struct list_head		entry;
+	struct llist_node		lentry;
+	struct nvmet_tcp_send_ctx	snd;
+	__le32				exp_ddgst;
+	__le32				recv_ddgst;
+};
+
+enum nvmet_tcp_queue_state {
+	NVMET_TCP_Q_CONNECTING,
+	NVMET_TCP_Q_LIVE,
+	NVMET_TCP_Q_DISCONNECTING,
+};
+
+struct nvmet_tcp_recv_ctx {
+	union nvme_tcp_pdu		pdu;
+	int				offset;
+	int				left;
+	enum nvmet_tcp_recv_state	state;
+	struct nvmet_tcp_cmd		*cmd;
+};
+
+struct nvmet_tcp_queue {
+	struct socket		*sock;
+	struct nvmet_tcp_port	*port;
+
+	struct nvmet_tcp_cmd	*cmds;
+	unsigned		nr_cmds;
+	struct list_head	free_list;
+	struct llist_head	resp_list;
+	struct list_head	resp_send_list;
+	int			send_list_len;
+
+	spinlock_t		state_lock;
+	enum nvmet_tcp_queue_state state;
+	struct nvmet_cq		nvme_cq;
+	struct nvmet_sq		nvme_sq;
+
+	struct sockaddr_storage	sockaddr;
+	struct sockaddr_storage	sockaddr_peer;
+	struct work_struct	release_work;
+	struct work_struct	io_work;
+
+	int			idx;
+	int			cpu;
+
+	struct list_head	queue_list;
+	struct nvmet_tcp_cmd	*snd_cmd;
+	struct nvmet_tcp_recv_ctx rcv;
+
+	bool			hdr_digest;
+	bool			data_digest;
+	struct ahash_request	*snd_hash;
+	struct ahash_request	*rcv_hash;
+
+	struct nvmet_tcp_cmd	connect;
+
+	struct page_frag_cache	pf_cache;
+
+	void (*old_data_ready)(struct sock *);
+	void (*old_state_change)(struct sock *);
+	void (*old_write_space)(struct sock *);
+};
+
+struct nvmet_tcp_port {
+	struct socket		*sock;
+	struct work_struct	accept_work;
+	struct nvmet_port	*nport;
+	struct sockaddr_storage addr;
+	int			last_cpu;
+	void (*old_data_ready) (struct sock *);
+};
+
+static DEFINE_IDA(nvmet_tcp_queue_ida);
+static LIST_HEAD(nvmet_tcp_queue_list);
+static DEFINE_MUTEX(nvmet_tcp_queue_mutex);
+
+static struct workqueue_struct *nvmet_tcp_wq;
+static struct nvmet_fabrics_ops nvmet_tcp_ops;
+static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
+static void nvmet_tcp_finish_cmd(struct nvmet_tcp_cmd *cmd);
+
+static inline u16 nvmet_tcp_cmd_id(struct nvmet_tcp_queue *queue,
+		struct nvmet_tcp_cmd *cmd)
+{
+	return cmd - queue->cmds;
+}
+
+static inline bool nvmet_tcp_has_data_in(struct nvmet_tcp_cmd *cmd)
+{
+	return nvme_is_write(cmd->req.cmd) &&
+		cmd->rbytes_done < cmd->req.transfer_len;
+}
+
+static inline bool nvmet_tcp_need_data_in(struct nvmet_tcp_cmd *cmd)
+{
+	return nvmet_tcp_has_data_in(cmd) && !cmd->req.rsp->status;
+}
+
+static inline bool nvmet_tcp_need_data_out(struct nvmet_tcp_cmd *cmd)
+{
+	return !nvme_is_write(cmd->req.cmd) &&
+		cmd->req.transfer_len > 0 &&
+		!cmd->req.rsp->status;
+}
+
+static inline bool nvmet_tcp_has_inline_data(struct nvmet_tcp_cmd *cmd)
+{
+	return nvme_is_write(cmd->req.cmd) && cmd->pdu_len &&
+		!cmd->rbytes_done;
+}
+
+static inline struct nvmet_tcp_cmd *
+nvmet_tcp_get_cmd(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd *cmd;
+
+	cmd = list_first_entry_or_null(&queue->free_list,
+				struct nvmet_tcp_cmd, entry);
+	if (!cmd)
+		return NULL;
+	list_del_init(&cmd->entry);
+
+	cmd->rbytes_done = cmd->wbytes_done = 0;
+	cmd->pdu_len = 0;
+	cmd->pdu_recv = 0;
+	cmd->iov = NULL;
+	cmd->flags = 0;
+	return cmd;
+}
+
+static inline void nvmet_tcp_put_cmd(struct nvmet_tcp_cmd *cmd)
+{
+	if (unlikely(cmd == &cmd->queue->connect))
+		return;
+
+	list_add_tail(&cmd->entry, &cmd->queue->free_list);
+}
+
+static inline u8 nvmet_tcp_hdgst_len(struct nvmet_tcp_queue *queue)
+{
+	return queue->hdr_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline u8 nvmet_tcp_ddgst_len(struct nvmet_tcp_queue *queue)
+{
+	return queue->data_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline void nvmet_tcp_hdgst(struct ahash_request *hash,
+		void *pdu, size_t len)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, pdu, len);
+	ahash_request_set_crypt(hash, &sg, pdu + len, len);
+	crypto_ahash_digest(hash);
+}
+
+static int nvmet_tcp_verify_hdgst(struct nvmet_tcp_queue *queue,
+	void *pdu, size_t len)
+{
+	struct nvme_tcp_hdr *hdr = pdu;
+	__le32 recv_digest;
+	__le32 exp_digest;
+
+	if (unlikely(!(hdr->flags & NVME_TCP_F_HDGST))) {
+		pr_err("queue %d: header digest enabled but pdu without digest\n",
+			queue->idx);
+		return -EPROTO;
+	}
+
+	recv_digest = *(__le32 *)(pdu + hdr->hlen);
+	nvmet_tcp_hdgst(queue->rcv_hash, pdu, len);
+	exp_digest = *(__le32 *)(pdu + hdr->hlen);
+	if (recv_digest != exp_digest) {
+		pr_err("queue %d: header digest error: recv %#x expected %#x\n",
+			queue->idx, le32_to_cpu(recv_digest),
+			le32_to_cpu(exp_digest));
+		return -EPROTO;
+	}
+
+	return 0;
+}
+
+static int nvmet_tcp_check_ddgst(struct nvmet_tcp_queue *queue, void *pdu)
+{
+	struct nvme_tcp_hdr *hdr = pdu;
+	u32 len;
+
+	len = le32_to_cpu(hdr->plen) - hdr->hlen -
+		(hdr->flags & NVME_TCP_F_HDGST ? nvmet_tcp_hdgst_len(queue) : 0);
+
+	if (unlikely(len && !(hdr->flags & NVME_TCP_F_DDGST))) {
+		pr_err("queue %d: data digest flag is cleared\n", queue->idx);
+		return -EPROTO;
+	}
+
+	return 0;
+}
+
+static void nvmet_tcp_unmap_pdu_iovec(struct nvmet_tcp_cmd *cmd)
+{
+	struct scatterlist *sg;
+	int i;
+
+	sg = &cmd->req.sg[cmd->sg_idx];
+
+	for (i = 0; i < cmd->nr_mapped; i++)
+		kunmap(sg_page(&sg[i]));
+}
+
+static void nvmet_tcp_map_pdu_iovec(struct nvmet_tcp_cmd *cmd)
+{
+	struct kvec *iov = cmd->iov;
+	struct scatterlist *sg;
+	u32 length, offset, sg_offset;
+
+	length = cmd->pdu_len;
+	cmd->nr_mapped = DIV_ROUND_UP(length, PAGE_SIZE);
+	offset = cmd->rbytes_done;
+	cmd->sg_idx = DIV_ROUND_UP(offset, PAGE_SIZE);
+	sg_offset = offset % PAGE_SIZE;
+	sg = &cmd->req.sg[cmd->sg_idx];
+
+	while (length) {
+		u32 iov_len = min_t(u32, length, sg->length - sg_offset);
+
+		iov->iov_base = kmap(sg_page(sg)) + sg->offset + sg_offset;
+		iov->iov_len = iov_len;
+
+		length -= iov_len;
+		sg = sg_next(sg);
+		iov++;
+	}
+
+	iov_iter_kvec(&cmd->recv_msg.msg_iter, READ, cmd->iov,
+		cmd->nr_mapped, cmd->pdu_len);
+}
+
+static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
+{
+	queue->rcv.state = NVMET_TCP_RECV_ERR;
+	if (queue->nvme_sq.ctrl)
+		nvmet_ctrl_fatal_error(queue->nvme_sq.ctrl);
+	else
+		kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+}
+
+static int nvmet_tcp_map_data(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvme_sgl_desc *sgl = &cmd->req.cmd->common.dptr.sgl;
+	u32 len = le32_to_cpu(sgl->length);
+
+	if (!cmd->req.data_len)
+		return 0;
+
+	if (sgl->type == ((NVME_SGL_FMT_DATA_DESC << 4) |
+			  NVME_SGL_FMT_OFFSET)) {
+		if (!nvme_is_write(cmd->req.cmd))
+			return NVME_SC_INVALID_FIELD | NVME_SC_DNR;
+
+		if (len > cmd->req.port->inline_data_size)
+			return NVME_SC_SGL_INVALID_OFFSET | NVME_SC_DNR;
+		cmd->pdu_len = len;
+	}
+	cmd->req.transfer_len += len;
+
+	cmd->req.sg = sgl_alloc(len, GFP_KERNEL, &cmd->req.sg_cnt);
+	if (!cmd->req.sg)
+		return NVME_SC_INTERNAL;
+	cmd->snd.cur_sg = cmd->req.sg;
+
+	if (nvmet_tcp_has_data_in(cmd)) {
+		cmd->iov = kmalloc_array(cmd->req.sg_cnt,
+				sizeof(*cmd->iov), GFP_KERNEL);
+		if (!cmd->iov)
+			goto err;
+	}
+
+	return 0;
+err:
+	sgl_free(cmd->req.sg);
+	return NVME_SC_INTERNAL;
+}
+
+static void nvmet_tcp_ddgst(struct ahash_request *hash,
+		struct nvmet_tcp_cmd *cmd)
+{
+	ahash_request_set_crypt(hash, cmd->req.sg,
+		(void *)&cmd->exp_ddgst, cmd->req.transfer_len);
+	crypto_ahash_digest(hash);
+}
+
+static void nvmet_setup_c2h_data_pdu(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvme_tcp_data_pdu *pdu = cmd->data_pdu;
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+	u8 ddgst = nvmet_tcp_ddgst_len(cmd->queue);
+
+	cmd->snd.offset = 0;
+	cmd->snd.state = NVMET_TCP_SEND_DATA_PDU;
+
+	pdu->hdr.type = nvme_tcp_c2h_data;
+	pdu->hdr.flags = NVME_TCP_F_DATA_LAST;
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.pdo = pdu->hdr.hlen + hdgst;
+	pdu->hdr.plen =
+		cpu_to_le32(pdu->hdr.hlen + hdgst + cmd->req.transfer_len + ddgst);
+	pdu->command_id = cmd->req.rsp->command_id;
+	pdu->data_length = cpu_to_le32(cmd->req.transfer_len);
+	pdu->data_offset = cpu_to_le32(cmd->wbytes_done);
+
+	if (queue->data_digest) {
+		pdu->hdr.flags |= NVME_TCP_F_DDGST;
+		nvmet_tcp_ddgst(queue->snd_hash, cmd);
+	}
+
+	if (cmd->queue->hdr_digest) {
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+		nvmet_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+	}
+}
+
+static void nvmet_setup_r2t_pdu(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvme_tcp_r2t_pdu *pdu = cmd->r2t_pdu;
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+
+	cmd->snd.offset = 0;
+	cmd->snd.state = NVMET_TCP_SEND_R2T;
+
+	pdu->hdr.type = nvme_tcp_r2t;
+	pdu->hdr.flags = 0;
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.pdo = 0;
+	pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+
+	pdu->command_id = cmd->req.cmd->common.command_id;
+	pdu->ttag = nvmet_tcp_cmd_id(cmd->queue, cmd);
+	pdu->r2t_length = cpu_to_le32(cmd->req.transfer_len - cmd->rbytes_done);
+	pdu->r2t_offset = cpu_to_le32(cmd->rbytes_done);
+	if (cmd->queue->hdr_digest) {
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+		nvmet_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+	}
+}
+
+static void nvmet_setup_response_pdu(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvme_tcp_rsp_pdu *pdu = cmd->rsp_pdu;
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+
+	cmd->snd.offset = 0;
+	cmd->snd.state = NVMET_TCP_SEND_RESPONSE;
+
+	pdu->hdr.type = nvme_tcp_rsp;
+	pdu->hdr.flags = 0;
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.pdo = 0;
+	pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+	if (cmd->queue->hdr_digest) {
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+		nvmet_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+	}
+}
+
+static struct nvmet_tcp_cmd *nvmet_tcp_reverse_list(struct nvmet_tcp_queue *queue, struct llist_node *node)
+{
+	struct nvmet_tcp_cmd *cmd;
+
+	while (node) {
+		struct nvmet_tcp_cmd *cmd = container_of(node, struct nvmet_tcp_cmd, lentry);
+
+		list_add(&cmd->entry, &queue->resp_send_list);
+		node = node->next;
+		queue->send_list_len++;
+	}
+
+	cmd = list_first_entry(&queue->resp_send_list, struct nvmet_tcp_cmd, entry);
+	return cmd;
+}
+
+static struct nvmet_tcp_cmd *nvmet_tcp_fetch_send_command(struct nvmet_tcp_queue *queue)
+{
+	struct llist_node *node;
+
+	queue->snd_cmd = list_first_entry_or_null(&queue->resp_send_list,
+				struct nvmet_tcp_cmd, entry);
+	if (!queue->snd_cmd) {
+		node = llist_del_all(&queue->resp_list);
+		if (!node)
+			return NULL;
+		queue->snd_cmd = nvmet_tcp_reverse_list(queue, node);
+	}
+
+	list_del_init(&queue->snd_cmd->entry);
+	queue->send_list_len--;
+
+	if (nvmet_tcp_need_data_out(queue->snd_cmd))
+		nvmet_setup_c2h_data_pdu(queue->snd_cmd);
+	else if (nvmet_tcp_need_data_in(queue->snd_cmd))
+		nvmet_setup_r2t_pdu(queue->snd_cmd);
+	else
+		nvmet_setup_response_pdu(queue->snd_cmd);
+
+	return queue->snd_cmd;
+}
+
+static void nvmet_tcp_queue_response(struct nvmet_req *req)
+{
+	struct nvmet_tcp_cmd *cmd =
+		container_of(req, struct nvmet_tcp_cmd, req);
+	struct nvmet_tcp_queue	*queue = cmd->queue;
+
+	llist_add(&cmd->lentry, &queue->resp_list);
+	queue_work_on(cmd->queue->cpu, nvmet_tcp_wq, &cmd->queue->io_work);
+}
+
+static int nvmet_try_send_data_pdu(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvmet_tcp_send_ctx *snd = &cmd->snd;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+	int left = sizeof(*cmd->data_pdu) - snd->offset + hdgst;
+	int ret;
+
+	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
+			offset_in_page(cmd->data_pdu) + snd->offset,
+			left, MSG_DONTWAIT | MSG_MORE);
+	if (ret <= 0)
+		return ret;
+
+	snd->offset += ret;
+	left -= ret;
+
+	if (left)
+		return -EAGAIN;
+
+	snd->state = NVMET_TCP_SEND_DATA;
+	snd->offset  = 0;
+	return 1;
+}
+
+static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvmet_tcp_send_ctx *snd = &cmd->snd;
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	int ret;
+
+	while (snd->cur_sg) {
+		struct page *page = sg_page(snd->cur_sg);
+		u32 left = snd->cur_sg->length - snd->offset;
+
+		ret = kernel_sendpage(cmd->queue->sock, page, snd->offset,
+					left, MSG_DONTWAIT | MSG_MORE);
+		if (ret <= 0)
+			return ret;
+
+		snd->offset += ret;
+		cmd->wbytes_done += ret;
+
+		/* Done with sg?*/
+		if (snd->offset == snd->cur_sg->length) {
+			snd->cur_sg = sg_next(snd->cur_sg);
+			snd->offset = 0;
+		}
+	}
+
+	if (queue->data_digest) {
+		cmd->snd.state = NVMET_TCP_SEND_DDGST;
+		snd->offset = 0;
+	} else {
+		nvmet_setup_response_pdu(cmd);
+	}
+	return 1;
+
+}
+
+static int nvmet_try_send_response(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
+{
+	struct nvmet_tcp_send_ctx *snd = &cmd->snd;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+	int left = sizeof(*cmd->rsp_pdu) - snd->offset + hdgst;
+	int flags = MSG_DONTWAIT;
+	int ret;
+
+	if (!last_in_batch && cmd->queue->send_list_len)
+		flags |= MSG_MORE;
+	else
+		flags |= MSG_EOR;
+
+	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
+			offset_in_page(cmd->rsp_pdu) + snd->offset, left, flags);
+	if (ret <= 0)
+		return ret;
+	snd->offset += ret;
+	left -= ret;
+
+	if (left)
+		return -EAGAIN;
+
+	kfree(cmd->iov);
+	sgl_free(cmd->req.sg);
+	cmd->queue->snd_cmd = NULL;
+	nvmet_tcp_put_cmd(cmd);
+	return 1;
+}
+
+static int nvmet_try_send_r2t(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
+{
+	struct nvmet_tcp_send_ctx *snd = &cmd->snd;
+	u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
+	int left = sizeof(*cmd->r2t_pdu) - snd->offset + hdgst;
+	int flags = MSG_DONTWAIT;
+	int ret;
+
+	if (!last_in_batch && cmd->queue->send_list_len)
+		flags |= MSG_MORE;
+	else
+		flags |= MSG_EOR;
+
+	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
+			offset_in_page(cmd->r2t_pdu) + snd->offset, left, flags);
+	if (ret <= 0)
+		return ret;
+	snd->offset += ret;
+	left -= ret;
+
+	if (left)
+		return -EAGAIN;
+
+	cmd->queue->snd_cmd = NULL;
+	return 1;
+}
+
+static int nvmet_try_send_ddgst(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvmet_tcp_queue *queue = cmd->queue;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+	struct kvec iov = {
+		.iov_base = &cmd->exp_ddgst + cmd->snd.offset,
+		.iov_len = NVME_TCP_DIGEST_LENGTH - cmd->snd.offset
+	};
+	int ret;
+
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+	if (unlikely(ret <= 0))
+		return ret;
+
+	cmd->snd.offset += ret;
+	nvmet_setup_response_pdu(cmd);
+	return 1;
+}
+
+static int nvmet_tcp_try_send_one(struct nvmet_tcp_queue *queue,
+		bool last_in_batch)
+{
+	struct nvmet_tcp_cmd *cmd = queue->snd_cmd;
+	int ret = 0;
+
+	if (!cmd || queue->state == NVMET_TCP_Q_DISCONNECTING) {
+		cmd = nvmet_tcp_fetch_send_command(queue);
+		if (unlikely(!cmd))
+			return 0;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_DATA_PDU) {
+		ret = nvmet_try_send_data_pdu(cmd);
+		if (ret <= 0)
+			goto done_send;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_DATA) {
+		ret = nvmet_try_send_data(cmd);
+		if (ret <= 0)
+			goto done_send;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_DDGST) {
+		ret = nvmet_try_send_ddgst(cmd);
+		if (ret <= 0)
+			goto done_send;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_R2T) {
+		ret = nvmet_try_send_r2t(cmd, last_in_batch);
+		if (ret <= 0)
+			goto done_send;
+	}
+
+	if (cmd->snd.state == NVMET_TCP_SEND_RESPONSE)
+		ret = nvmet_try_send_response(cmd, last_in_batch);
+
+done_send:
+	if (ret < 0) {
+		if (ret == -EAGAIN)
+			return 0;
+		return ret;
+	}
+
+	return 1;
+}
+
+static int nvmet_tcp_try_send(struct nvmet_tcp_queue *queue,
+		int budget, int *sends)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < budget; i++) {
+		ret = nvmet_tcp_try_send_one(queue, i == budget - 1);
+		if (ret <= 0)
+			break;
+		(*sends)++;
+	}
+
+	return ret;
+}
+
+static void nvmet_prepare_receive_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+
+	rcv->offset = 0;
+	rcv->left = sizeof(struct nvme_tcp_hdr);
+	rcv->cmd = NULL;
+	rcv->state = NVMET_TCP_RECV_PDU;
+}
+
+static void nvmet_tcp_free_crypto(struct nvmet_tcp_queue *queue)
+{
+	struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
+
+	ahash_request_free(queue->rcv_hash);
+	ahash_request_free(queue->snd_hash);
+	crypto_free_ahash(tfm);
+}
+
+static int nvmet_tcp_alloc_crypto(struct nvmet_tcp_queue *queue)
+{
+	struct crypto_ahash *tfm;
+
+	tfm = crypto_alloc_ahash("crc32c", 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(tfm))
+		return PTR_ERR(tfm);
+
+	queue->snd_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+	if (!queue->snd_hash)
+		goto free_tfm;
+	ahash_request_set_callback(queue->snd_hash, 0, NULL, NULL);
+
+	queue->rcv_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+	if (!queue->rcv_hash)
+		goto free_snd_hash;
+	ahash_request_set_callback(queue->rcv_hash, 0, NULL, NULL);
+
+	return 0;
+free_snd_hash:
+	ahash_request_free(queue->snd_hash);
+free_tfm:
+	crypto_free_ahash(tfm);
+	return -ENOMEM;
+}
+
+
+static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
+{
+	struct nvme_tcp_icreq_pdu *icreq = &queue->rcv.pdu.icreq;
+	struct nvme_tcp_icresp_pdu *icresp = &queue->rcv.pdu.icresp;
+	struct msghdr msg = {};
+	struct kvec iov;
+	int ret;
+
+	if (le32_to_cpu(icreq->hdr.plen) != sizeof(struct nvme_tcp_icreq_pdu)) {
+		pr_err("bad nvme-tcp pdu length (%d)\n",
+			le32_to_cpu(icreq->hdr.plen));
+		nvmet_tcp_fatal_error(queue);
+	}
+
+	if (icreq->pfv != NVME_TCP_PFV_1_0) {
+		pr_err("queue %d: bad pfv %d\n", queue->idx, icreq->pfv);
+		return -EINVAL;
+	}
+
+	queue->hdr_digest = !!(icreq->digest & NVME_TCP_HDR_DIGEST_ENABLE);
+	queue->data_digest = !!(icreq->digest & NVME_TCP_DATA_DIGEST_ENABLE);
+	if (queue->hdr_digest || queue->data_digest) {
+		ret = nvmet_tcp_alloc_crypto(queue);
+		if (ret)
+			return ret;
+	}
+
+	if (icreq->hpda != 0) {
+		pr_err("queue %d: unsupported hpda %d\n", queue->idx,
+			icreq->hpda);
+		ret = -EPROTO;
+		goto free_crypto;
+	}
+
+	memset(icresp, 0, sizeof(*icresp));
+	icresp->hdr.type = nvme_tcp_icresp;
+	icresp->hdr.hlen = sizeof(*icresp);
+	icresp->hdr.pdo = 0;
+	icresp->hdr.plen = cpu_to_le32(icresp->hdr.hlen);
+	icresp->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
+	icresp->maxdata = 0xffff; /* FIXME: support r2t */
+	icresp->cpda = 0;
+	if (queue->hdr_digest)
+		icresp->digest |= NVME_TCP_HDR_DIGEST_ENABLE;
+	if (queue->data_digest)
+		icresp->digest |= NVME_TCP_DATA_DIGEST_ENABLE;
+
+	iov.iov_base = icresp;
+	iov.iov_len = sizeof(*icresp);
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+	if (ret < 0)
+		goto free_crypto;
+
+	queue->state = NVMET_TCP_Q_LIVE;
+	nvmet_prepare_receive_pdu(queue);
+	return 0;
+free_crypto:
+	if (queue->hdr_digest || queue->data_digest)
+		nvmet_tcp_free_crypto(queue);
+	return ret;
+}
+
+static void nvmet_tcp_handle_req_failure(struct nvmet_tcp_queue *queue,
+		struct nvmet_tcp_cmd *cmd, struct nvmet_req *req)
+{
+	int ret;
+
+	/* recover the expected data transfer length */
+	req->data_len = le32_to_cpu(req->cmd->common.dptr.sgl.length);
+
+	if (!nvme_is_write(cmd->req.cmd) ||
+	    req->data_len > cmd->req.port->inline_data_size) {
+		nvmet_prepare_receive_pdu(queue);
+		return;
+	}
+
+	ret = nvmet_tcp_map_data(cmd);
+	if (unlikely(ret)) {
+		pr_err("queue %d: failed to map data\n", queue->idx);
+		nvmet_tcp_fatal_error(queue);
+		return;
+	}
+
+	queue->rcv.state = NVMET_TCP_RECV_DATA;
+	nvmet_tcp_map_pdu_iovec(cmd);
+	cmd->flags |= NVMET_TCP_F_INIT_FAILED;
+}
+
+static int nvmet_tcp_handle_h2c_data_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_data_pdu *data = &rcv->pdu.data;
+	struct nvmet_tcp_cmd *cmd;
+
+	cmd = &queue->cmds[data->ttag];
+
+	if (le32_to_cpu(data->data_offset) != cmd->rbytes_done) {
+		pr_err("queue %d ttag %u unexpected data offset %u (expected %u)\n",
+			queue->idx, data->ttag, le32_to_cpu(data->data_offset),
+			cmd->rbytes_done);
+		/* FIXME: use path and transport errors */
+		nvmet_req_complete(&cmd->req,
+			NVME_SC_INVALID_FIELD | NVME_SC_DNR);
+		return -EPROTO;
+	}
+
+	cmd->pdu_len = le32_to_cpu(data->data_length);
+	cmd->pdu_recv = 0;
+	nvmet_tcp_map_pdu_iovec(cmd);
+	rcv->cmd = cmd;
+	rcv->state = NVMET_TCP_RECV_DATA;
+
+	return 0;
+}
+
+static int nvmet_tcp_done_recv_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_hdr *hdr = &rcv->pdu.cmd.hdr;
+	struct nvme_command *nvme_cmd = &rcv->pdu.cmd.cmd;
+	struct nvmet_req *req;
+	int ret;
+
+	if (unlikely(queue->state == NVMET_TCP_Q_CONNECTING)) {
+		if (hdr->type != nvme_tcp_icreq) {
+			pr_err("unexpected pdu type (%d) before icreq\n",
+				hdr->type);
+			nvmet_tcp_fatal_error(queue);
+			return -EPROTO;
+		}
+		return nvmet_tcp_handle_icreq(queue);
+	}
+
+	if (hdr->type == nvme_tcp_h2c_data) {
+		ret = nvmet_tcp_handle_h2c_data_pdu(queue);
+		if(unlikely(ret))
+			return ret;
+		return 0;
+	}
+
+	rcv->cmd = nvmet_tcp_get_cmd(queue);
+	if (unlikely(!rcv->cmd)) {
+		/* This should never happen */
+		pr_err("queue %d: failed get command nr_cmds: %d, send_list_len: %d, opcode: %d",
+			queue->idx, queue->nr_cmds, queue->send_list_len, nvme_cmd->common.opcode);
+		nvmet_tcp_fatal_error(queue);
+		return -ENOMEM;
+	}
+
+	req = &rcv->cmd->req;
+	memcpy(req->cmd, nvme_cmd, sizeof(*nvme_cmd));
+
+	if (unlikely(!nvmet_req_init(req, &queue->nvme_cq,
+			&queue->nvme_sq, &nvmet_tcp_ops))) {
+		pr_err("failed cmd %p id %d opcode %d, data_len: %d\n",
+			req->cmd, req->cmd->common.command_id,
+			req->cmd->common.opcode,
+			le32_to_cpu(req->cmd->common.dptr.sgl.length));
+
+		nvmet_tcp_handle_req_failure(queue, rcv->cmd, req);
+		return -EAGAIN;
+	}
+
+	ret = nvmet_tcp_map_data(rcv->cmd);
+	if (unlikely(ret)) {
+		pr_err("queue %d: failed to map data\n", queue->idx);
+		if (nvmet_tcp_has_inline_data(rcv->cmd))
+			nvmet_tcp_fatal_error(queue);
+		else
+			nvmet_req_complete(req, ret);
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (nvmet_tcp_need_data_in(rcv->cmd)) {
+		if (nvmet_tcp_has_inline_data(rcv->cmd)) {
+			rcv->state = NVMET_TCP_RECV_DATA;
+			nvmet_tcp_map_pdu_iovec(rcv->cmd);
+			return 0;
+		} else {
+			/* send back R2T */
+			nvmet_tcp_queue_response(&rcv->cmd->req);
+			goto out;
+		}
+	}
+
+	nvmet_req_execute(&rcv->cmd->req);
+out:
+	nvmet_prepare_receive_pdu(queue);
+	return ret;
+}
+
+static const u8 nvme_tcp_pdu_sizes[] = {
+	[nvme_tcp_icreq]	= sizeof(struct nvme_tcp_icreq_pdu),
+	[nvme_tcp_cmd]		= sizeof(struct nvme_tcp_cmd_pdu),
+	[nvme_tcp_h2c_data]	= sizeof(struct nvme_tcp_data_pdu),
+};
+
+static inline u8 nvmet_tcp_pdu_size(u8 type)
+{
+	size_t idx = type;
+
+	return (idx < ARRAY_SIZE(nvme_tcp_pdu_sizes) && nvme_tcp_pdu_sizes[idx]) ?
+			nvme_tcp_pdu_sizes[idx] : 0;
+}
+
+static inline bool nvmet_tcp_pdu_valid(u8 type)
+{
+	switch (type) {
+	case nvme_tcp_icreq:
+	case nvme_tcp_cmd:
+	case nvme_tcp_h2c_data:
+		/* fallthru */
+		return true;
+	}
+
+	return false;
+}
+
+static int nvmet_tcp_try_recv_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_hdr *hdr = &rcv->pdu.cmd.hdr;
+	int len;
+	struct kvec iov;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+
+recv:
+	iov.iov_base = (void *)&rcv->pdu + rcv->offset;
+	iov.iov_len = rcv->left;
+	len = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (unlikely(len < 0))
+		return len;
+
+	rcv->offset += len;
+	rcv->left -= len;
+	if (rcv->left) {
+		return -EAGAIN;
+	} else if (rcv->offset == sizeof(struct nvme_tcp_hdr)) {
+		u8 hdgst = nvmet_tcp_hdgst_len(queue);
+
+		if (unlikely(!nvmet_tcp_pdu_valid(hdr->type))) {
+			pr_err("unexpected pdu type %d\n", hdr->type);
+			nvmet_tcp_fatal_error(queue);
+			return -EIO;
+		}
+
+		if (unlikely(hdr->hlen != nvmet_tcp_pdu_size(hdr->type))) {
+			pr_err("pdu type %d bad hlen %d\n", hdr->type, hdr->hlen);
+			return -EIO;
+		}
+
+		rcv->left = hdr->hlen - rcv->offset + hdgst;
+		goto recv;
+	}
+
+	if (queue->hdr_digest &&
+	    nvmet_tcp_verify_hdgst(queue, &rcv->pdu, rcv->offset)) {
+		nvmet_tcp_fatal_error(queue); /* fatal */
+		return -EPROTO;
+	}
+
+	if (queue->data_digest &&
+	    nvmet_tcp_check_ddgst(queue, &rcv->pdu)) {
+		nvmet_tcp_fatal_error(queue); /* fatal */
+		return -EPROTO;
+	}
+
+	return nvmet_tcp_done_recv_pdu(queue);
+}
+
+static void nvmet_tcp_prep_recv_ddgst(struct nvmet_tcp_cmd *cmd)
+{
+	struct nvmet_tcp_queue *queue = cmd->queue;
+
+	nvmet_tcp_ddgst(queue->rcv_hash, cmd);
+	queue->rcv.offset = 0;
+	queue->rcv.left = NVME_TCP_DIGEST_LENGTH;
+	queue->rcv.state = NVMET_TCP_RECV_DDGST;
+}
+
+static int nvmet_tcp_try_recv_data(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd  *cmd = queue->rcv.cmd;
+	int ret;
+
+	while (msg_data_left(&cmd->recv_msg)) {
+		ret = sock_recvmsg(cmd->queue->sock, &cmd->recv_msg,
+			cmd->recv_msg.msg_flags);
+		if (ret <= 0)
+			return ret;
+
+		cmd->pdu_recv += ret;
+		cmd->rbytes_done += ret;
+	}
+
+	nvmet_tcp_unmap_pdu_iovec(cmd);
+
+	if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED) &&
+	    cmd->rbytes_done == cmd->req.transfer_len) {
+		if (queue->data_digest) {
+			nvmet_tcp_prep_recv_ddgst(cmd);
+			return 0;
+		} else {
+			nvmet_req_execute(&cmd->req);
+		}
+	}
+
+	nvmet_prepare_receive_pdu(queue);
+	return 0;
+}
+
+static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	struct nvmet_tcp_cmd *cmd = rcv->cmd;
+	int ret;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+	struct kvec iov = {
+		.iov_base = (void *)&cmd->recv_ddgst + rcv->offset,
+		.iov_len = rcv->left
+	};
+
+	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (unlikely(ret < 0))
+		return ret;
+
+	rcv->offset += ret;
+	rcv->left -= ret;
+	if (rcv->left)
+		return -EAGAIN;
+
+	if (queue->data_digest && cmd->exp_ddgst != cmd->recv_ddgst) {
+		pr_err("queue %d: cmd %d pdu (%d) data digest error: recv %#x expected %#x\n",
+		queue->idx, cmd->req.cmd->common.command_id, rcv->pdu.cmd.hdr.type,
+		le32_to_cpu(cmd->recv_ddgst), le32_to_cpu(cmd->exp_ddgst));
+		nvmet_tcp_finish_cmd(cmd);
+		nvmet_tcp_fatal_error(queue);
+		ret = -EPROTO;
+		goto out;
+	}
+
+	if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED) &&
+	    cmd->rbytes_done == cmd->req.transfer_len)
+		nvmet_req_execute(&cmd->req);
+	ret = 0;
+out:
+	nvmet_prepare_receive_pdu(queue);
+	return ret;
+}
+
+static int nvmet_tcp_try_recv_one(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_recv_ctx *rcv = &queue->rcv;
+	int result;
+
+	if (unlikely(rcv->state == NVMET_TCP_RECV_ERR))
+		return 0;
+
+	if (rcv->state == NVMET_TCP_RECV_PDU) {
+		result = nvmet_tcp_try_recv_pdu(queue);
+		if (result != 0)
+			goto done_recv;
+	}
+
+	if (rcv->state == NVMET_TCP_RECV_DATA) {
+		result = nvmet_tcp_try_recv_data(queue);
+		if (result != 0)
+			goto done_recv;
+	}
+
+	if (rcv->state == NVMET_TCP_RECV_DDGST) {
+		result = nvmet_tcp_try_recv_ddgst(queue);
+		if (result != 0)
+			goto done_recv;
+	}
+
+done_recv:
+	if (result < 0) {
+		if (result == -EAGAIN)
+			return 0;
+		return result;
+	}
+	return 1;
+}
+
+static int nvmet_tcp_try_recv(struct nvmet_tcp_queue *queue,
+		int budget, int *recvs)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < budget; i++) {
+		ret = nvmet_tcp_try_recv_one(queue);
+		if (ret <= 0)
+			break;
+		(*recvs)++;
+	}
+
+	return ret;
+}
+
+static void nvmet_tcp_schedule_release_queue(struct nvmet_tcp_queue *queue)
+{
+	spin_lock(&queue->state_lock);
+	if (queue->state == NVMET_TCP_Q_DISCONNECTING)
+		goto out;
+
+	queue->state = NVMET_TCP_Q_DISCONNECTING;
+	schedule_work(&queue->release_work);
+out:
+	spin_unlock(&queue->state_lock);
+}
+
+static void nvmet_tcp_io_work(struct work_struct *w)
+{
+	struct nvmet_tcp_queue *queue =
+		container_of(w, struct nvmet_tcp_queue, io_work);
+	bool pending;
+	int ret, ops = 0;
+
+	do {
+		pending = false;
+
+		ret = nvmet_tcp_try_recv(queue, NVMET_TCP_RECV_BUDGET, &ops);
+		if (ret > 0) {
+			pending = true;
+		} else if (ret < 0) {
+			if (ret == -EPIPE || ret == -ECONNRESET)
+				kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+			else
+				nvmet_tcp_fatal_error(queue);
+			return;
+		}
+
+		ret = nvmet_tcp_try_send(queue, NVMET_TCP_SEND_BUDGET, &ops);
+		if (ret > 0) {
+			/* transmitted message/data */
+			pending = true;
+		} else if (ret < 0) {
+			if (ret == -EPIPE || ret == -ECONNRESET)
+				kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+			else
+				nvmet_tcp_fatal_error(queue);
+			return;
+		}
+
+	} while (pending && ops < NVMET_TCP_IO_WORK_BUDGET);
+
+	/*
+	 * We exahusted our budget, requeue our selves
+	 */
+	if (pending)
+		queue_work_on(queue->cpu, nvmet_tcp_wq, &queue->io_work);
+}
+
+static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
+		struct nvmet_tcp_cmd *c)
+{
+	u8 hdgst = nvmet_tcp_hdgst_len(queue);
+
+	c->queue = queue;
+	c->req.port = queue->port->nport;
+
+	c->cmd_pdu = page_frag_alloc(&queue->pf_cache,
+			sizeof(*c->cmd_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	if (!c->cmd_pdu)
+		return -ENOMEM;
+	c->req.cmd = &c->cmd_pdu->cmd;
+
+	c->rsp_pdu = page_frag_alloc(&queue->pf_cache,
+			sizeof(*c->rsp_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	if (!c->rsp_pdu)
+		goto out_free_cmd;
+	c->req.rsp = &c->rsp_pdu->cqe;
+
+	c->data_pdu = page_frag_alloc(&queue->pf_cache,
+			sizeof(*c->data_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	if (!c->data_pdu)
+		goto out_free_rsp;
+
+	c->r2t_pdu = page_frag_alloc(&queue->pf_cache,
+			sizeof(*c->r2t_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	if (!c->r2t_pdu)
+		goto out_free_data;
+
+	c->recv_msg.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
+
+	list_add_tail(&c->entry, &queue->free_list);
+
+	return 0;
+out_free_data:
+	page_frag_free(c->data_pdu);
+out_free_rsp:
+	page_frag_free(c->rsp_pdu);
+out_free_cmd:
+	page_frag_free(c->cmd_pdu);
+	return -ENOMEM;
+}
+
+static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c)
+{
+	page_frag_free(c->r2t_pdu);
+	page_frag_free(c->data_pdu);
+	page_frag_free(c->rsp_pdu);
+	page_frag_free(c->cmd_pdu);
+}
+
+static int nvmet_tcp_alloc_cmds(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd *cmds;
+	int i, ret = -EINVAL, nr_cmds = queue->nr_cmds;
+
+	cmds = kcalloc(nr_cmds, sizeof(struct nvmet_tcp_cmd), GFP_KERNEL);
+	if (!cmds)
+		goto out;
+
+	for (i = 0; i < nr_cmds; i++) {
+		ret = nvmet_tcp_alloc_cmd(queue, cmds + i);
+		if (ret)
+			goto out_free;
+	}
+
+	queue->cmds = cmds;
+
+	return 0;
+out_free:
+	while (--i >= 0)
+		nvmet_tcp_free_cmd(cmds + i);
+	kfree(cmds);
+out:
+	return ret;
+}
+
+static void nvmet_tcp_free_cmds(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd *cmds = queue->cmds;
+	int i;
+
+	for (i = 0; i < queue->nr_cmds; i++)
+		nvmet_tcp_free_cmd(cmds + i);
+
+	nvmet_tcp_free_cmd(&queue->connect);
+	kfree(cmds);
+}
+
+static void nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
+{
+	struct socket *sock = queue->sock;
+
+	write_lock_bh(&sock->sk->sk_callback_lock);
+	sock->sk->sk_data_ready =  queue->old_data_ready;
+	sock->sk->sk_state_change = queue->old_state_change;
+	sock->sk->sk_write_space = queue->old_write_space;
+	sock->sk->sk_user_data = NULL;
+	write_unlock_bh(&sock->sk->sk_callback_lock);
+}
+
+static void nvmet_tcp_finish_cmd(struct nvmet_tcp_cmd *cmd)
+{
+	nvmet_req_uninit(&cmd->req);
+	nvmet_tcp_unmap_pdu_iovec(cmd);
+	sgl_free(cmd->req.sg);
+}
+
+static void nvmet_tcp_uninit_data_in_cmds(struct nvmet_tcp_queue *queue)
+{
+	struct nvmet_tcp_cmd *cmd = queue->cmds;
+	int i;
+
+	for (i = 0; i < queue->nr_cmds; i++, cmd++) {
+		if (nvmet_tcp_need_data_in(cmd))
+			nvmet_tcp_finish_cmd(cmd);
+	}
+
+	if (!queue->nr_cmds && nvmet_tcp_need_data_in(&queue->connect)) {
+		/* failed in connect */
+		nvmet_tcp_finish_cmd(&queue->connect);
+	}
+}
+
+static void nvmet_tcp_release_queue_work(struct work_struct *w)
+{
+	struct nvmet_tcp_queue *queue =
+		container_of(w, struct nvmet_tcp_queue, release_work);
+
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_del_init(&queue->queue_list);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+
+	nvmet_tcp_restore_socket_callbacks(queue);
+	flush_work(&queue->io_work);
+
+	nvmet_tcp_uninit_data_in_cmds(queue);
+	nvmet_sq_destroy(&queue->nvme_sq);
+	cancel_work_sync(&queue->io_work);
+	sock_release(queue->sock);
+	nvmet_tcp_free_cmds(queue);
+	if (queue->hdr_digest || queue->data_digest)
+		nvmet_tcp_free_crypto(queue);
+	ida_simple_remove(&nvmet_tcp_queue_ida, queue->idx);
+
+	kfree(queue);
+}
+
+static void nvmet_tcp_data_ready(struct sock *sk)
+{
+	struct nvmet_tcp_queue *queue;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (!queue)
+		goto out;
+
+	queue_work_on(queue->cpu, nvmet_tcp_wq, &queue->io_work);
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvmet_tcp_write_space(struct sock *sk)
+{
+	struct nvmet_tcp_queue *queue;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (!queue)
+		goto out;
+
+	if (unlikely(queue->state == NVMET_TCP_Q_CONNECTING)) {
+		queue->old_write_space(sk);
+		goto out;
+	}
+
+	if (sk_stream_is_writeable(sk)) {
+		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+		queue_work_on(queue->cpu, nvmet_tcp_wq, &queue->io_work);
+	}
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvmet_tcp_state_change(struct sock *sk)
+{
+	struct nvmet_tcp_queue *queue;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (!queue)
+		goto done;
+
+	switch (sk->sk_state) {
+	case TCP_FIN_WAIT1:
+	case TCP_CLOSE_WAIT:
+	case TCP_CLOSE:
+		/* FALLTHRU */
+		sk->sk_user_data = NULL;
+		nvmet_tcp_schedule_release_queue(queue);
+		break;
+	default:
+		pr_warn("queue %d unhandled state %d\n", queue->idx, sk->sk_state);
+	}
+done:
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
+{
+	struct socket *sock = queue->sock;
+	struct linger sol = { .l_onoff = 1, .l_linger = 0 };
+	int ret;
+
+	ret = kernel_getsockname(sock,
+		(struct sockaddr *)&queue->sockaddr);
+	if (ret < 0)
+		return ret;
+
+	ret = kernel_getpeername(sock,
+		(struct sockaddr *)&queue->sockaddr_peer);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Cleanup whatever is sitting in the TCP transmit queue on socket
+	 * close. This is done to prevent stale data from being sent should
+	 * the network connection be restored before TCP times out.
+	 */
+	ret = kernel_setsockopt(sock, SOL_SOCKET, SO_LINGER,
+			(char *)&sol, sizeof(sol));
+	if (ret)
+		return ret;
+
+	write_lock_bh(&sock->sk->sk_callback_lock);
+	sock->sk->sk_user_data = queue;
+	queue->old_data_ready = sock->sk->sk_data_ready;
+	sock->sk->sk_data_ready = nvmet_tcp_data_ready;
+	queue->old_state_change = sock->sk->sk_state_change;
+	sock->sk->sk_state_change = nvmet_tcp_state_change;
+	queue->old_write_space = sock->sk->sk_write_space;
+	sock->sk->sk_write_space = nvmet_tcp_write_space;
+	write_unlock_bh(&sock->sk->sk_callback_lock);
+
+	return 0;
+}
+
+static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
+		struct socket *newsock)
+{
+	struct nvmet_tcp_queue *queue;
+	int ret;
+
+	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+	if (!queue)
+		return -ENOMEM;
+
+	INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
+	INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
+	queue->sock = newsock;
+	queue->port = port;
+	queue->nr_cmds = 0;
+	spin_lock_init(&queue->state_lock);
+	queue->state = NVMET_TCP_Q_CONNECTING;
+	INIT_LIST_HEAD(&queue->free_list);
+	init_llist_head(&queue->resp_list);
+	INIT_LIST_HEAD(&queue->resp_send_list);
+
+	queue->idx = ida_simple_get(&nvmet_tcp_queue_ida, 0, 0, GFP_KERNEL);
+	if (queue->idx < 0) {
+		ret = queue->idx;
+		goto out_free_queue;
+	}
+
+	ret = nvmet_tcp_alloc_cmd(queue, &queue->connect);
+	if (ret)
+		goto out_ida_remove;
+
+	ret = nvmet_sq_init(&queue->nvme_sq);
+	if (ret)
+		goto out_ida_remove;
+
+	port->last_cpu = cpumask_next_wrap(port->last_cpu,
+				cpu_online_mask, -1, false);
+	queue->cpu = port->last_cpu;
+	nvmet_prepare_receive_pdu(queue);
+
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_add_tail(&queue->queue_list, &nvmet_tcp_queue_list);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+
+	ret = nvmet_tcp_set_queue_sock(queue);
+	if (ret)
+		goto out_destroy_sq;
+
+	queue_work_on(queue->cpu, nvmet_tcp_wq, &queue->io_work);
+
+	return 0;
+out_destroy_sq:
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_del_init(&queue->queue_list);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+	nvmet_sq_destroy(&queue->nvme_sq);
+out_ida_remove:
+	ida_simple_remove(&nvmet_tcp_queue_ida, queue->idx);
+out_free_queue:
+	kfree(queue);
+	return ret;
+}
+
+static void nvmet_tcp_accept_work(struct work_struct *w)
+{
+	struct nvmet_tcp_port *port =
+		container_of(w, struct nvmet_tcp_port, accept_work);
+	struct socket *newsock;
+	int ret;
+
+	while (true) {
+		ret = kernel_accept(port->sock, &newsock, O_NONBLOCK);
+		if (ret < 0) {
+			if (ret != -EAGAIN)
+				pr_warn("failed to accept err=%d\n", ret);
+			return;
+		}
+		ret = nvmet_tcp_alloc_queue(port, newsock);
+		if (ret) {
+			pr_err("failed to allocate queue\n");
+			sock_release(newsock);
+		}
+	}
+}
+
+static void nvmet_tcp_listen_data_ready(struct sock *sk)
+{
+	struct nvmet_tcp_port *port;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	port = sk->sk_user_data;
+	if (!port)
+		goto out;
+
+	if (sk->sk_state == TCP_LISTEN)
+		schedule_work(&port->accept_work);
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static int nvmet_tcp_add_port(struct nvmet_port *nport)
+{
+	struct nvmet_tcp_port *port;
+	__kernel_sa_family_t af;
+	int opt, ret;
+
+	port = kzalloc(sizeof(*port), GFP_KERNEL);
+	if (!port)
+		return -ENOMEM;
+
+	switch (nport->disc_addr.adrfam) {
+	case NVMF_ADDR_FAMILY_IP4:
+		af = AF_INET;
+		break;
+	case NVMF_ADDR_FAMILY_IP6:
+		af = AF_INET6;
+		break;
+	default:
+		pr_err("address family %d not supported\n",
+				nport->disc_addr.adrfam);
+		ret = -EINVAL;
+		goto err_port;
+	}
+
+	ret = inet_pton_with_scope(&init_net, af, nport->disc_addr.traddr,
+			nport->disc_addr.trsvcid, &port->addr);
+	if (ret) {
+		pr_err("malformed ip/port passed: %s:%s\n",
+			nport->disc_addr.traddr, nport->disc_addr.trsvcid);
+		goto err_port;
+	}
+
+	port->nport = nport;
+	port->last_cpu = -1;
+	INIT_WORK(&port->accept_work, nvmet_tcp_accept_work);
+	if (port->nport->inline_data_size < 0)
+		port->nport->inline_data_size = NVMET_TCP_DEF_INLINE_DATA_SIZE;
+
+	ret = sock_create(port->addr.ss_family, SOCK_STREAM,
+				IPPROTO_TCP, &port->sock);
+	if (ret) {
+		pr_err("failed to create a socket\n");
+		goto err_port;
+	}
+
+	port->sock->sk->sk_user_data = port;
+	port->old_data_ready = port->sock->sk->sk_data_ready;
+	port->sock->sk->sk_data_ready = nvmet_tcp_listen_data_ready;
+
+	opt = 1;
+	ret = kernel_setsockopt(port->sock, IPPROTO_TCP,
+			TCP_NODELAY, (char *)&opt, sizeof(opt));
+	if (ret) {
+		pr_err("failed to set TCP_NODELAY sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	ret = kernel_setsockopt(port->sock, SOL_SOCKET, SO_REUSEADDR,
+			(char *)&opt, sizeof(opt));
+	if (ret) {
+		pr_err("failed to set SO_REUSEADDR sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	ret = kernel_bind(port->sock, (struct sockaddr *)&port->addr,
+			sizeof(port->addr));
+	if (ret) {
+		pr_err("failed to bind port socket %d\n", ret);
+		goto err_sock;
+	}
+
+	ret = kernel_listen(port->sock, 128);
+	if (ret) {
+		pr_err("failed to listen %d on port sock\n", ret);
+		goto err_sock;
+	}
+
+	nport->priv = port;
+	pr_info("enabling port %d (%pISpc)\n",
+		le16_to_cpu(nport->disc_addr.portid), &port->addr);
+
+	return 0;
+
+err_sock:
+	sock_release(port->sock);
+err_port:
+	kfree(port);
+	return ret;
+}
+
+static void nvmet_tcp_remove_port(struct nvmet_port *nport)
+{
+	struct nvmet_tcp_port *port = nport->priv;
+
+	write_lock_bh(&port->sock->sk->sk_callback_lock);
+	port->sock->sk->sk_data_ready = port->old_data_ready;
+	port->sock->sk->sk_user_data = NULL;
+	write_unlock_bh(&port->sock->sk->sk_callback_lock);
+	cancel_work_sync(&port->accept_work);
+
+	sock_release(port->sock);
+	kfree(port);
+}
+
+static void nvmet_tcp_delete_ctrl(struct nvmet_ctrl *ctrl)
+{
+	struct nvmet_tcp_queue *queue;
+
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_for_each_entry(queue, &nvmet_tcp_queue_list, queue_list)
+		if (queue->nvme_sq.ctrl == ctrl)
+			kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+}
+
+static u16 nvmet_tcp_install_queue(struct nvmet_sq *sq)
+{
+	struct nvmet_tcp_queue *queue =
+		container_of(sq, struct nvmet_tcp_queue, nvme_sq);
+	int ret;
+
+	if (sq->qid == 0) {
+		/* Let inflight controller teardown complete */
+		flush_scheduled_work();
+	}
+
+	queue->nr_cmds = sq->size * 2;
+	if (nvmet_tcp_alloc_cmds(queue))
+		return NVME_SC_INTERNAL;
+	return 0;
+}
+
+static void nvmet_tcp_disc_port_addr(struct nvmet_req *req,
+		struct nvmet_port *nport, char *traddr)
+{
+	struct nvmet_tcp_port *port = nport->priv;
+
+	if (inet_addr_is_any((struct sockaddr *)&port->addr)) {
+		struct nvmet_tcp_cmd *cmd =
+			container_of(req, struct nvmet_tcp_cmd, req);
+		struct nvmet_tcp_queue *queue = cmd->queue;
+
+		sprintf(traddr, "%pISc", (struct sockaddr *)&queue->sockaddr);
+	} else {
+		memcpy(traddr, nport->disc_addr.traddr, NVMF_TRADDR_SIZE);
+	}
+}
+
+static struct nvmet_fabrics_ops nvmet_tcp_ops = {
+	.owner			= THIS_MODULE,
+	.type			= NVMF_TRTYPE_TCP,
+	.msdbd			= 1,
+	.has_keyed_sgls		= 0,
+	.add_port		= nvmet_tcp_add_port,
+	.remove_port		= nvmet_tcp_remove_port,
+	.queue_response		= nvmet_tcp_queue_response,
+	.delete_ctrl		= nvmet_tcp_delete_ctrl,
+	.install_queue		= nvmet_tcp_install_queue,
+	.disc_traddr		= nvmet_tcp_disc_port_addr,
+};
+
+static int __init nvmet_tcp_init(void)
+{
+	int ret;
+
+	nvmet_tcp_wq = alloc_workqueue("nvmet_tcp_wq", WQ_HIGHPRI, 0);
+	if (!nvmet_tcp_wq)
+		return -ENOMEM;
+
+	ret = nvmet_register_transport(&nvmet_tcp_ops);
+	if (ret)
+		goto err;
+
+	return 0;
+err:
+	destroy_workqueue(nvmet_tcp_wq);
+	return ret;
+}
+
+static void __exit nvmet_tcp_exit(void)
+{
+	struct nvmet_tcp_queue *queue;
+
+	nvmet_unregister_transport(&nvmet_tcp_ops);
+
+	flush_scheduled_work();
+	mutex_lock(&nvmet_tcp_queue_mutex);
+	list_for_each_entry(queue, &nvmet_tcp_queue_list, queue_list)
+		kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+	mutex_unlock(&nvmet_tcp_queue_mutex);
+	flush_scheduled_work();
+
+	destroy_workqueue(nvmet_tcp_wq);
+}
+
+module_init(nvmet_tcp_init);
+module_exit(nvmet_tcp_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS("nvmet-transport-3"); /* 3 == NVMF_TRTYPE_TCP */
diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
index 33c8afaf63bd..685780d1ed04 100644
--- a/include/linux/nvme-tcp.h
+++ b/include/linux/nvme-tcp.h
@@ -11,6 +11,7 @@
 
 #define NVME_TCP_DISC_PORT	8009
 #define NVME_TCP_ADMIN_CCSZ	SZ_8K
+#define NVME_TCP_DIGEST_LENGTH	4
 
 enum nvme_tcp_pfv {
 	NVME_TCP_PFV_1_0 = 0x0,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:

	nvme connect-all -t tcp -a $IPADDR

This requires the latest version of nvme-cli with TCP support.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
---
 drivers/nvme/host/Kconfig  |   15 +
 drivers/nvme/host/Makefile |    3 +
 drivers/nvme/host/tcp.c    | 2306 ++++++++++++++++++++++++++++++++++++
 3 files changed, 2324 insertions(+)
 create mode 100644 drivers/nvme/host/tcp.c

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..0f345e207675 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -57,3 +57,18 @@ config NVME_FC
 	  from https://github.com/linux-nvme/nvme-cli.
 
 	  If unsure, say N.
+
+config NVME_TCP
+	tristate "NVM Express over Fabrics TCP host driver"
+	depends on INET
+	depends on BLK_DEV_NVME
+	select NVME_FABRICS
+	help
+	  This provides support for the NVMe over Fabrics protocol using
+	  the TCP transport.  This allows you to use remote block devices
+	  exported using the NVMe protocol set.
+
+	  To configure a NVMe over Fabrics controller use the nvme-cli tool
+	  from https://github.com/linux-nvme/nvme-cli.
+
+	  If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index aea459c65ae1..8a4b671c5f0c 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME)		+= nvme.o
 obj-$(CONFIG_NVME_FABRICS)		+= nvme-fabrics.o
 obj-$(CONFIG_NVME_RDMA)			+= nvme-rdma.o
 obj-$(CONFIG_NVME_FC)			+= nvme-fc.o
+obj-$(CONFIG_NVME_TCP)			+= nvme-tcp.o
 
 nvme-core-y				:= core.o
 nvme-core-$(CONFIG_TRACING)		+= trace.o
@@ -21,3 +22,5 @@ nvme-fabrics-y				+= fabrics.o
 nvme-rdma-y				+= rdma.o
 
 nvme-fc-y				+= fc.o
+
+nvme-tcp-y				+= tcp.o
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
new file mode 100644
index 000000000000..4c583859f0ad
--- /dev/null
+++ b/drivers/nvme/host/tcp.c
@@ -0,0 +1,2306 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP host.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/nvme-tcp.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <linux/blk-mq.h>
+#include <crypto/hash.h>
+
+#include "nvme.h"
+#include "fabrics.h"
+
+struct nvme_tcp_queue;
+
+enum nvme_tcp_send_state {
+	NVME_TCP_SEND_CMD_PDU = 0,
+	NVME_TCP_SEND_H2C_PDU,
+	NVME_TCP_SEND_DATA,
+	NVME_TCP_SEND_DDGST,
+};
+
+struct nvme_tcp_send_ctx {
+	struct bio		*curr_bio;
+	struct iov_iter		iter;
+	size_t			offset;
+	size_t			data_sent;
+	enum nvme_tcp_send_state state;
+};
+
+struct nvme_tcp_recv_ctx {
+	struct iov_iter		iter;
+	struct bio		*curr_bio;
+};
+
+struct nvme_tcp_request {
+	struct nvme_request	req;
+	void			*pdu;
+	struct nvme_tcp_queue	*queue;
+	u32			data_len;
+	u32			pdu_len;
+	u32			pdu_sent;
+	u16			ttag;
+	struct list_head	entry;
+	struct nvme_tcp_recv_ctx rcv;
+	struct nvme_tcp_send_ctx snd;
+	u32			ddgst;
+};
+
+enum nvme_tcp_queue_flags {
+	NVME_TCP_Q_ALLOCATED	= 0,
+	NVME_TCP_Q_LIVE		= 1,
+};
+
+enum nvme_tcp_recv_state {
+	NVME_TCP_RECV_PDU = 0,
+	NVME_TCP_RECV_DATA,
+	NVME_TCP_RECV_DDGST,
+};
+
+struct nvme_tcp_queue_recv_ctx {
+	char		*pdu;
+	int		pdu_remaining;
+	int		pdu_offset;
+	size_t		data_remaining;
+	size_t		ddgst_remaining;
+};
+
+struct nvme_tcp_ctrl;
+struct nvme_tcp_queue {
+	struct socket		*sock;
+	struct work_struct	io_work;
+	int			io_cpu;
+
+	spinlock_t		lock;
+	struct list_head	send_list;
+
+	int			queue_size;
+	size_t			cmnd_capsule_len;
+	struct nvme_tcp_ctrl	*ctrl;
+	unsigned long		flags;
+	bool			rd_enabled;
+
+	struct nvme_tcp_queue_recv_ctx rcv;
+	struct nvme_tcp_request *request;
+
+	bool			hdr_digest;
+	bool			data_digest;
+	struct ahash_request	*rcv_hash;
+	struct ahash_request	*snd_hash;
+	__le32			exp_ddgst;
+	__le32			recv_ddgst;
+
+	struct page_frag_cache	pf_cache;
+
+	void (*sc)(struct sock *);
+	void (*dr)(struct sock *);
+	void (*ws)(struct sock *);
+};
+
+struct nvme_tcp_ctrl {
+	/* read only in the hot path */
+	struct nvme_tcp_queue	*queues;
+	struct blk_mq_tag_set	tag_set;
+
+	/* other member variables */
+	struct list_head	list;
+	struct blk_mq_tag_set	admin_tag_set;
+	struct sockaddr_storage addr;
+	struct sockaddr_storage src_addr;
+	struct nvme_ctrl	ctrl;
+
+	struct nvme_tcp_request async_req;
+};
+
+static LIST_HEAD(nvme_tcp_ctrl_list);
+static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
+static struct workqueue_struct *nvme_tcp_wq;
+static struct blk_mq_ops nvme_tcp_mq_ops;
+static struct blk_mq_ops nvme_tcp_admin_mq_ops;
+
+static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
+{
+	return container_of(ctrl, struct nvme_tcp_ctrl, ctrl);
+}
+
+static inline int nvme_tcp_queue_id(struct nvme_tcp_queue *queue)
+{
+	return queue - queue->ctrl->queues;
+}
+
+static inline struct blk_mq_tags *nvme_tcp_tagset(struct nvme_tcp_queue *queue)
+{
+	u32 queue_idx = nvme_tcp_queue_id(queue);
+
+	if (queue_idx == 0)
+		return queue->ctrl->admin_tag_set.tags[queue_idx];
+	return queue->ctrl->tag_set.tags[queue_idx - 1];
+}
+
+static inline u8 nvme_tcp_hdgst_len(struct nvme_tcp_queue *queue)
+{
+	return queue->hdr_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline u8 nvme_tcp_ddgst_len(struct nvme_tcp_queue *queue)
+{
+	return queue->data_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline size_t nvme_tcp_inline_data_size(struct nvme_tcp_queue *queue)
+{
+	return queue->cmnd_capsule_len - sizeof(struct nvme_command);
+}
+
+static inline bool nvme_tcp_async_req(struct nvme_tcp_request *req)
+{
+	return req == &req->queue->ctrl->async_req;
+}
+
+static inline bool nvme_tcp_has_inline_data(struct nvme_tcp_request *req)
+{
+	struct request *rq;
+	unsigned int bytes;
+
+	if (unlikely(nvme_tcp_async_req(req)))
+		return false; /* async events don't have a request */
+
+	rq = blk_mq_rq_from_pdu(req);
+	bytes = blk_rq_payload_bytes(rq);
+
+	return rq_data_dir(rq) == WRITE && bytes &&
+		bytes <= nvme_tcp_inline_data_size(req->queue);
+}
+
+static inline struct page *nvme_tcp_req_cur_page(struct nvme_tcp_request *req)
+{
+	return req->snd.iter.bvec->bv_page;
+}
+
+static inline size_t nvme_tcp_req_cur_offset(struct nvme_tcp_request *req)
+{
+	return req->snd.iter.bvec->bv_offset + req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req)
+{
+	return min_t(size_t, req->snd.iter.bvec->bv_len - req->snd.iter.iov_offset,
+			req->pdu_len - req->pdu_sent);
+}
+
+static inline size_t nvme_tcp_req_offset(struct nvme_tcp_request *req)
+{
+	return req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req)
+{
+	return rq_data_dir(blk_mq_rq_from_pdu(req)) == WRITE ?
+			req->pdu_len - req->pdu_sent : 0;
+}
+
+static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
+		int len)
+{
+	return nvme_tcp_pdu_data_left(req) <= len;
+}
+
+static void nvme_tcp_init_send_iter(struct nvme_tcp_request *req)
+{
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	struct bio_vec *vec;
+	unsigned int size;
+	int nsegs;
+	size_t offset;
+
+	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		vec = &rq->special_vec;
+		nsegs = 1;
+		size = blk_rq_payload_bytes(rq);
+		offset = 0;
+	} else {
+		struct bio *bio = req->snd.curr_bio;
+
+		vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+		nsegs = bio_segments(bio);
+		size = bio->bi_iter.bi_size;
+		offset = bio->bi_iter.bi_bvec_done;
+	}
+
+	iov_iter_bvec(&req->snd.iter, WRITE, vec, nsegs, size);
+	req->snd.iter.iov_offset = offset;
+}
+
+static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
+		int len)
+{
+	req->snd.data_sent += len;
+	req->pdu_sent += len;
+	iov_iter_advance(&req->snd.iter, len);
+	if (!iov_iter_count(&req->snd.iter) &&
+	    req->snd.data_sent < req->data_len) {
+		req->snd.curr_bio = req->snd.curr_bio->bi_next;
+		nvme_tcp_init_send_iter(req);
+	}
+}
+
+static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+
+	spin_lock_bh(&queue->lock);
+	list_add_tail(&req->entry, &queue->send_list);
+	spin_unlock_bh(&queue->lock);
+
+	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static inline struct nvme_tcp_request *
+nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_request *req;
+
+	spin_lock_bh(&queue->lock);
+	req = list_first_entry_or_null(&queue->send_list,
+			struct nvme_tcp_request, entry);
+	if (req)
+		list_del(&req->entry);
+	spin_unlock_bh(&queue->lock);
+
+	return req;
+}
+
+static inline void nvme_tcp_ddgst_final(struct ahash_request *hash, u32 *dgst)
+{
+	ahash_request_set_crypt(hash, NULL, (u8 *)dgst, 0);
+	crypto_ahash_final(hash);
+}
+
+static inline void nvme_tcp_ddgst_update(struct ahash_request *hash,
+		struct page *page, off_t off, size_t len)
+{
+	struct scatterlist sg;
+
+	sg_init_marker(&sg, 1);
+	sg_set_page(&sg, page, len, off);
+	ahash_request_set_crypt(hash, &sg, NULL, len);
+	crypto_ahash_update(hash);
+}
+
+static inline void nvme_tcp_hdgst(struct ahash_request *hash,
+		void *pdu, size_t len)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, pdu, len);
+	ahash_request_set_crypt(hash, &sg, pdu + len, len);
+	crypto_ahash_digest(hash);
+}
+
+static int nvme_tcp_verify_hdgst(struct nvme_tcp_queue *queue,
+	void *pdu, size_t pdu_len)
+{
+	struct nvme_tcp_hdr *hdr = pdu;
+	__le32 recv_digest;
+	__le32 exp_digest;
+
+	if (unlikely(!(hdr->flags & NVME_TCP_F_HDGST))) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d: header digest flag is cleared\n",
+			nvme_tcp_queue_id(queue));
+		return -EPROTO;
+	}
+
+	recv_digest = *(__le32 *)(pdu + hdr->hlen);
+	nvme_tcp_hdgst(queue->rcv_hash, pdu, pdu_len);
+	exp_digest = *(__le32 *)(pdu + hdr->hlen);
+	if (recv_digest != exp_digest) {
+		dev_err(queue->ctrl->ctrl.device,
+			"header digest error: recv %#x expected %#x\n",
+			le32_to_cpu(recv_digest), le32_to_cpu(exp_digest));
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int nvme_tcp_check_ddgst(struct nvme_tcp_queue *queue, void *pdu)
+{
+	struct nvme_tcp_hdr *hdr = pdu;
+	u32 len;
+
+	len = le32_to_cpu(hdr->plen) - hdr->hlen -
+		((hdr->flags & NVME_TCP_F_HDGST) ? nvme_tcp_hdgst_len(queue) : 0);
+
+	if (unlikely(len && !(hdr->flags & NVME_TCP_F_DDGST))) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d: data digest flag is cleared\n",
+		nvme_tcp_queue_id(queue));
+		return -EPROTO;
+	}
+	crypto_ahash_init(queue->rcv_hash);
+
+	return 0;
+}
+
+static void nvme_tcp_exit_request(struct blk_mq_tag_set *set,
+		struct request *rq, unsigned int hctx_idx)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+	page_frag_free(req->pdu);
+}
+
+static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
+		struct request *rq, unsigned int hctx_idx,
+		unsigned int numa_node)
+{
+	struct nvme_tcp_ctrl *ctrl = set->driver_data;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	int queue_idx = (set == &ctrl->tag_set) ? hctx_idx + 1 : 0;
+	struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+	req->pdu = page_frag_alloc(&queue->pf_cache,
+		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+		GFP_KERNEL | __GFP_ZERO);
+	if (!req->pdu)
+		return -ENOMEM;
+
+	req->queue = queue;
+	nvme_req(rq)->ctrl = &ctrl->ctrl;
+
+	return 0;
+}
+
+static int nvme_tcp_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+		unsigned int hctx_idx)
+{
+	struct nvme_tcp_ctrl *ctrl = data;
+	struct nvme_tcp_queue *queue = &ctrl->queues[hctx_idx + 1];
+
+	BUG_ON(hctx_idx >= ctrl->ctrl.queue_count);
+
+	hctx->driver_data = queue;
+	return 0;
+}
+
+static int nvme_tcp_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+		unsigned int hctx_idx)
+{
+	struct nvme_tcp_ctrl *ctrl = data;
+	struct nvme_tcp_queue *queue = &ctrl->queues[0];
+
+	BUG_ON(hctx_idx != 0);
+
+	hctx->driver_data = queue;
+	return 0;
+}
+
+static enum nvme_tcp_recv_state nvme_tcp_recv_state(struct nvme_tcp_queue *queue)
+{
+	return  (queue->rcv.pdu_remaining) ? NVME_TCP_RECV_PDU :
+		(queue->rcv.ddgst_remaining) ? NVME_TCP_RECV_DDGST :
+		NVME_TCP_RECV_DATA;
+}
+
+static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+
+	rcv->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
+				nvme_tcp_hdgst_len(queue);
+	rcv->pdu_offset = 0;
+	rcv->data_remaining = -1;
+	rcv->ddgst_remaining = 0;
+}
+
+void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
+{
+	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+		return;
+
+	queue_work(nvme_wq, &ctrl->err_work);
+}
+
+static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
+		struct nvme_completion *cqe)
+{
+	struct request *rq;
+	struct nvme_tcp_request *req;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
+	if (!rq) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag 0x%x not found\n",
+			nvme_tcp_queue_id(queue), cqe->command_id);
+		nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+		return -EINVAL;
+	}
+	req = blk_mq_rq_to_pdu(rq);
+
+	nvme_end_request(rq, cqe->status, cqe->result);
+
+	return 0;
+}
+
+static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_data_pdu *pdu)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_request *req;
+	struct request *rq;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	if (!rq) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag %#x not found\n",
+			nvme_tcp_queue_id(queue), pdu->command_id);
+		return -ENOENT;
+	}
+	req = blk_mq_rq_to_pdu(rq);
+
+	if (!blk_rq_payload_bytes(rq)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag %#x unexpected data\n",
+			nvme_tcp_queue_id(queue), rq->tag);
+		return -EIO;
+	}
+
+	rcv->data_remaining = le32_to_cpu(pdu->data_length);
+	/* No support for out-of-order */
+	WARN_ON(le32_to_cpu(pdu->data_offset));
+
+	return 0;
+
+}
+
+static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_rsp_pdu *pdu)
+{
+	struct nvme_completion *cqe = &pdu->cqe;
+	int ret = 0;
+
+	/*
+	 * AEN requests are special as they don't time out and can
+	 * survive any kind of queue freeze and often don't respond to
+	 * aborts.  We don't even bother to allocate a struct request
+	 * for them but rather special case them here.
+	 */
+	if (unlikely(nvme_tcp_queue_id(queue) == 0 &&
+	    cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH))
+		nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
+				&cqe->result);
+	else
+		ret = nvme_tcp_process_nvme_cqe(queue, cqe);
+
+	return ret;
+}
+
+static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
+		struct nvme_tcp_r2t_pdu *pdu)
+{
+	struct nvme_tcp_data_pdu *data = req->pdu;
+	struct nvme_tcp_queue *queue = req->queue;
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+	u8 ddgst = nvme_tcp_ddgst_len(queue);
+
+	req->pdu_len = le32_to_cpu(pdu->r2t_length);
+	req->pdu_sent = 0;
+
+	if (unlikely(req->snd.data_sent + req->pdu_len > req->data_len)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"req %d r2t length %u exceeded data length %u (%zu sent)\n",
+			rq->tag, req->pdu_len, req->data_len,
+			req->snd.data_sent);
+		return -EPROTO;
+	}
+
+	if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->snd.data_sent)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"req %d unexpected r2t offset %u (expected %zu)\n",
+			rq->tag, le32_to_cpu(pdu->r2t_offset),
+			req->snd.data_sent);
+		return -EPROTO;
+	}
+
+	memset(data, 0, sizeof(*data));
+	data->hdr.type = nvme_tcp_h2c_data;
+	data->hdr.flags = NVME_TCP_F_DATA_LAST;
+	if (queue->hdr_digest)
+		data->hdr.flags |= NVME_TCP_F_HDGST;
+	if (queue->data_digest)
+		data->hdr.flags |= NVME_TCP_F_DDGST;
+	data->hdr.hlen = sizeof(*data);
+	data->hdr.pdo = data->hdr.hlen + hdgst;
+	data->hdr.plen =
+		cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
+	data->ttag = pdu->ttag;
+	data->command_id = rq->tag;
+	data->data_offset = cpu_to_le32(req->snd.data_sent);
+	data->data_length = cpu_to_le32(req->pdu_len);
+	return 0;
+}
+
+static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_r2t_pdu *pdu)
+{
+	struct nvme_tcp_request *req;
+	struct request *rq;
+	int ret;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	if (!rq) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag %#x not found\n",
+			nvme_tcp_queue_id(queue), pdu->command_id);
+		return -ENOENT;
+	}
+	req = blk_mq_rq_to_pdu(rq);
+
+	ret = nvme_tcp_setup_h2c_data_pdu(req, pdu);
+	if (unlikely(ret))
+		return ret;
+
+	req->snd.state = NVME_TCP_SEND_H2C_PDU;
+	req->snd.offset = 0;
+
+	nvme_tcp_queue_request(req);
+
+	return 0;
+}
+
+static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+		unsigned int *offset, size_t *len)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_hdr *hdr;
+	size_t rcv_len = min_t(size_t, *len, rcv->pdu_remaining);
+	int ret;
+
+	ret = skb_copy_bits(skb, *offset, &rcv->pdu[rcv->pdu_offset], rcv_len);
+	if (unlikely(ret))
+		return ret;
+
+	rcv->pdu_remaining -= rcv_len;
+	rcv->pdu_offset += rcv_len;
+	*offset += rcv_len;
+	*len -= rcv_len;
+	if (queue->rcv.pdu_remaining)
+		return 0;
+
+	hdr = (void *)rcv->pdu;
+	if (queue->hdr_digest) {
+		ret = nvme_tcp_verify_hdgst(queue, rcv->pdu, hdr->hlen);
+		if (unlikely(ret))
+			return ret;
+	}
+
+
+	if (queue->data_digest) {
+		ret = nvme_tcp_check_ddgst(queue, rcv->pdu);
+		if (unlikely(ret))
+			return ret;
+	}
+
+	switch (hdr->type) {
+	case nvme_tcp_c2h_data:
+		ret = nvme_tcp_handle_c2h_data(queue, (void *)rcv->pdu);
+		break;
+	case nvme_tcp_rsp:
+		nvme_tcp_init_recv_ctx(queue);
+		ret = nvme_tcp_handle_comp(queue, (void *)rcv->pdu);
+		break;
+	case nvme_tcp_r2t:
+		nvme_tcp_init_recv_ctx(queue);
+		ret = nvme_tcp_handle_r2t(queue, (void *)rcv->pdu);
+		break;
+	default:
+		dev_err(queue->ctrl->ctrl.device, "unsupported pdu type (%d)\n",
+			hdr->type);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static void nvme_tcp_init_recv_iter(struct nvme_tcp_request *req)
+{
+	struct bio *bio = req->rcv.curr_bio;
+	struct bio_vec *vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+	unsigned int nsegs = bio_segments(bio);
+
+	iov_iter_bvec(&req->rcv.iter, READ, vec, nsegs,
+		bio->bi_iter.bi_size);
+	req->rcv.iter.iov_offset = bio->bi_iter.bi_bvec_done;
+}
+
+static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+			      unsigned int *offset, size_t *len)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_data_pdu *pdu = (void *)rcv->pdu;
+	struct nvme_tcp_request *req;
+	struct request *rq;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	if (!rq) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag %#x not found\n",
+			nvme_tcp_queue_id(queue), pdu->command_id);
+		return -ENOENT;
+	}
+	req = blk_mq_rq_to_pdu(rq);
+
+	while (true) {
+		int recv_len, ret;
+
+		recv_len = min_t(size_t, *len, rcv->data_remaining);
+		if (!recv_len)
+			break;
+
+		/*
+		 * FIXME: This assumes that data comes in-order,
+		 *  need to handle the out-of-order case.
+		 */
+		if (!iov_iter_count(&req->rcv.iter)) {
+			req->rcv.curr_bio = req->rcv.curr_bio->bi_next;
+
+			/*
+			 * If we don`t have any bios it means that controller
+			 * sent more data than we requested, hence error
+			 */
+			if (!req->rcv.curr_bio) {
+				dev_err(queue->ctrl->ctrl.device,
+					"queue %d no space in request %#x",
+					nvme_tcp_queue_id(queue), rq->tag);
+				nvme_tcp_init_recv_ctx(queue);
+				return -EIO;
+			}
+			nvme_tcp_init_recv_iter(req);
+		}
+
+		/* we can read only from what is left in this bio */
+		recv_len = min_t(size_t, recv_len,
+				iov_iter_count(&req->rcv.iter));
+
+		if (queue->data_digest)
+			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
+				&req->rcv.iter, recv_len, queue->rcv_hash);
+		else
+			ret = skb_copy_datagram_iter(skb, *offset,
+					&req->rcv.iter, recv_len);
+		if (ret) {
+			dev_err(queue->ctrl->ctrl.device,
+				"queue %d failed to copy request %#x data",
+				nvme_tcp_queue_id(queue), rq->tag);
+			return ret;
+		}
+
+		*len -= recv_len;
+		*offset += recv_len;
+		rcv->data_remaining -= recv_len;
+	}
+
+	if (!rcv->data_remaining) {
+		if (queue->data_digest) {
+			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
+			rcv->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
+		} else {
+			nvme_tcp_init_recv_ctx(queue);
+		}
+	}
+
+	return 0;
+}
+
+static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
+		struct sk_buff *skb, unsigned int *offset, size_t *len)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+	char *ddgst = (char *)&queue->recv_ddgst;
+	size_t recv_len = min_t(size_t, *len, rcv->ddgst_remaining);
+	off_t off = NVME_TCP_DIGEST_LENGTH - rcv->ddgst_remaining;
+	int ret;
+
+	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
+	if (unlikely(ret))
+		return ret;
+
+	rcv->ddgst_remaining -= recv_len;
+	*offset += recv_len;
+	*len -= recv_len;
+	if (rcv->ddgst_remaining)
+		return 0;
+
+	if (queue->recv_ddgst != queue->exp_ddgst) {
+		dev_err(queue->ctrl->ctrl.device,
+			"data digest error: recv %#x expected %#x\n",
+			le32_to_cpu(queue->recv_ddgst),
+			le32_to_cpu(queue->exp_ddgst));
+		return -EIO;
+	}
+
+	nvme_tcp_init_recv_ctx(queue);
+	return 0;
+}
+
+static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+			     unsigned int offset, size_t len)
+{
+	struct nvme_tcp_queue *queue = desc->arg.data;
+	size_t consumed = len;
+	int result;
+
+	while (len) {
+		switch (nvme_tcp_recv_state(queue)) {
+		case NVME_TCP_RECV_PDU:
+			result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
+			break;
+		case NVME_TCP_RECV_DATA:
+			result = nvme_tcp_recv_data(queue, skb, &offset, &len);
+			break;
+		case NVME_TCP_RECV_DDGST:
+			result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
+			break;
+		default:
+			result = -EFAULT;
+		}
+		if (result) {
+			dev_err(queue->ctrl->ctrl.device,
+				"receive failed:  %d\n", result);
+			queue->rd_enabled = false;
+			nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+			return result;
+		}
+	}
+
+	return consumed;
+}
+
+static void nvme_tcp_data_ready(struct sock *sk)
+{
+	struct nvme_tcp_queue *queue;
+
+	read_lock(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (unlikely(!queue || !queue->rd_enabled))
+		goto done;
+
+	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+done:
+	read_unlock(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_write_space(struct sock *sk)
+{
+	struct nvme_tcp_queue *queue;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+
+	if (!queue)
+		goto done;
+
+	if (sk_stream_is_writeable(sk)) {
+		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	}
+done:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_state_change(struct sock *sk)
+{
+	struct nvme_tcp_queue *queue;
+
+	read_lock(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (!queue)
+		goto done;
+
+	switch (sk->sk_state) {
+	case TCP_CLOSE:
+	case TCP_CLOSE_WAIT:
+	case TCP_LAST_ACK:
+	case TCP_FIN_WAIT1:
+	case TCP_FIN_WAIT2:
+		/* fallthrough */
+		nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+		break;
+	default:
+		dev_info(queue->ctrl->ctrl.device,
+			"queue %d socket state %d\n",
+			nvme_tcp_queue_id(queue), sk->sk_state);
+	}
+
+	queue->sc(sk);
+done:
+	read_unlock(&sk->sk_callback_lock);
+}
+
+static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
+{
+	queue->request = NULL;
+}
+
+static void nvme_tcp_fail_request(struct nvme_tcp_request *req)
+{
+	union nvme_result res = {};
+
+	nvme_end_request(blk_mq_rq_from_pdu(req),
+		NVME_SC_DATA_XFER_ERROR, res);
+}
+
+static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+
+	while (true) {
+		struct page *page = nvme_tcp_req_cur_page(req);
+		size_t offset = nvme_tcp_req_cur_offset(req);
+		size_t len = nvme_tcp_req_cur_length(req);
+		bool last = nvme_tcp_pdu_last_send(req, len);
+		int ret, flags = MSG_DONTWAIT;
+
+		if (last && !queue->data_digest)
+			flags |= MSG_EOR;
+		else
+			flags |= MSG_MORE;
+
+		ret = kernel_sendpage(queue->sock, page, offset, len, flags);
+		if (ret <= 0)
+			return ret;
+
+		nvme_tcp_advance_req(req, ret);
+		if (queue->data_digest)
+			nvme_tcp_ddgst_update(queue->snd_hash, page, offset, ret);
+
+		/* fully successfull last write*/
+		if (last && ret == len) {
+			if (queue->data_digest) {
+				nvme_tcp_ddgst_final(queue->snd_hash,
+					&req->ddgst);
+				req->snd.state = NVME_TCP_SEND_DDGST;
+				req->snd.offset = 0;
+			} else {
+				nvme_tcp_done_send_req(queue);
+			}
+			return 1;
+		}
+	}
+	return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+	struct nvme_tcp_send_ctx *snd = &req->snd;
+	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+	bool inline_data = nvme_tcp_has_inline_data(req);
+	int flags = MSG_DONTWAIT | (inline_data ? MSG_MORE : MSG_EOR);
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+	int len = sizeof(*pdu) + hdgst - snd->offset;
+	int ret;
+
+	if (queue->hdr_digest && !snd->offset)
+		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+			offset_in_page(pdu) + snd->offset, len,  flags);
+	if (unlikely(ret <= 0))
+		return ret;
+
+	len -= ret;
+	if (!len) {
+		if (inline_data) {
+			req->snd.state = NVME_TCP_SEND_DATA;
+			if (queue->data_digest)
+				crypto_ahash_init(queue->snd_hash);
+			nvme_tcp_init_send_iter(req);
+		} else {
+			nvme_tcp_done_send_req(queue);
+		}
+		return 1;
+	}
+	snd->offset += ret;
+
+	return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+	struct nvme_tcp_send_ctx *snd = &req->snd;
+	struct nvme_tcp_data_pdu *pdu = req->pdu;
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+	int len = sizeof(*pdu) - snd->offset + hdgst;
+	int ret;
+
+	if (queue->hdr_digest && !snd->offset)
+		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+			offset_in_page(pdu) + snd->offset, len,
+			MSG_DONTWAIT | MSG_MORE);
+	if (unlikely(ret <= 0))
+		return ret;
+
+	len -= ret;
+	if (!len) {
+		req->snd.state = NVME_TCP_SEND_DATA;
+		if (queue->data_digest)
+			crypto_ahash_init(queue->snd_hash);
+		if (!req->snd.data_sent)
+			nvme_tcp_init_send_iter(req);
+		return 1;
+	}
+	snd->offset += ret;
+
+	return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+	int ret;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+	struct kvec iov = {
+		.iov_base = &req->ddgst + req->snd.offset,
+		.iov_len = NVME_TCP_DIGEST_LENGTH - req->snd.offset
+	};
+
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+	if (unlikely(ret <= 0))
+		return ret;
+
+	if (req->snd.offset + ret == NVME_TCP_DIGEST_LENGTH) {
+		nvme_tcp_done_send_req(queue);
+		return 1;
+	}
+
+	req->snd.offset += ret;
+	return -EAGAIN;
+}
+
+static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_request *req;
+	int ret = 1;
+
+	if (!queue->request) {
+		queue->request = nvme_tcp_fetch_request(queue);
+		if (!queue->request)
+			return 0;
+	}
+	req = queue->request;
+
+	if (req->snd.state == NVME_TCP_SEND_CMD_PDU) {
+		ret = nvme_tcp_try_send_cmd_pdu(req);
+		if (ret <= 0)
+			goto done;
+		if (!nvme_tcp_has_inline_data(req))
+			return ret;
+	}
+
+	if (req->snd.state == NVME_TCP_SEND_H2C_PDU) {
+		ret = nvme_tcp_try_send_data_pdu(req);
+		if (ret <= 0)
+			goto done;
+	}
+
+	if (req->snd.state == NVME_TCP_SEND_DATA) {
+		ret = nvme_tcp_try_send_data(req);
+		if (ret <= 0)
+			goto done;
+	}
+
+	if (req->snd.state == NVME_TCP_SEND_DDGST)
+		ret = nvme_tcp_try_send_ddgst(req);
+done:
+	if (ret == -EAGAIN)
+		ret = 0;
+	return ret;
+}
+
+static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
+{
+	struct sock *sk = queue->sock->sk;
+	read_descriptor_t rd_desc;
+	int consumed;
+
+	rd_desc.arg.data = queue;
+	rd_desc.count = 1;
+	lock_sock(sk);
+	consumed = tcp_read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+	release_sock(sk);
+	return consumed;
+}
+
+static void nvme_tcp_io_work(struct work_struct *w)
+{
+	struct nvme_tcp_queue *queue =
+		container_of(w, struct nvme_tcp_queue, io_work);
+	unsigned long start = jiffies + msecs_to_jiffies(1);
+
+	do {
+		bool pending = false;
+		int result;
+
+		result = nvme_tcp_try_send(queue);
+		if (result > 0) {
+			pending = true;
+		} else if (unlikely(result < 0)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"failed to send request %d\n", result);
+			if (result != -EPIPE)
+				nvme_tcp_fail_request(queue->request);
+			nvme_tcp_done_send_req(queue);
+			return;
+		}
+
+		result = nvme_tcp_try_recv(queue);
+		if (result > 0)
+			pending = true;
+
+		if (!pending)
+			return;
+
+	} while (time_after(jiffies, start)); /* quota is exhausted */
+
+	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
+{
+	struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
+
+	ahash_request_free(queue->rcv_hash);
+	ahash_request_free(queue->snd_hash);
+	crypto_free_ahash(tfm);
+}
+
+static int nvme_tcp_alloc_crypto(struct nvme_tcp_queue *queue)
+{
+	struct crypto_ahash *tfm;
+
+	tfm = crypto_alloc_ahash("crc32c", 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(tfm))
+		return PTR_ERR(tfm);
+
+	queue->snd_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+	if (!queue->snd_hash)
+		goto free_tfm;
+	ahash_request_set_callback(queue->snd_hash, 0, NULL, NULL);
+
+	queue->rcv_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+	if (!queue->rcv_hash)
+		goto free_snd_hash;
+	ahash_request_set_callback(queue->rcv_hash, 0, NULL, NULL);
+
+	return 0;
+free_snd_hash:
+	ahash_request_free(queue->snd_hash);
+free_tfm:
+	crypto_free_ahash(tfm);
+	return -ENOMEM;
+}
+
+static void nvme_tcp_free_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+	struct nvme_tcp_request *async = &ctrl->async_req;
+
+	page_frag_free(async->pdu);
+}
+
+static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+	struct nvme_tcp_queue *queue = &ctrl->queues[0];
+	struct nvme_tcp_request *async = &ctrl->async_req;
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+	async->pdu = page_frag_alloc(&queue->pf_cache,
+		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+		GFP_KERNEL | __GFP_ZERO);
+	if (!async->pdu)
+		return -ENOMEM;
+
+	async->queue = &ctrl->queues[0];
+	return 0;
+}
+
+static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+	if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
+		return;
+
+	if (queue->hdr_digest || queue->data_digest)
+		nvme_tcp_free_crypto(queue);
+
+	sock_release(queue->sock);
+	kfree(queue->rcv.pdu);
+}
+
+static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_icreq_pdu *icreq;
+	struct nvme_tcp_icresp_pdu *icresp;
+	struct msghdr msg = {};
+	struct kvec iov;
+	bool ctrl_hdgst, ctrl_ddgst;
+	int ret;
+
+	icreq = kzalloc(sizeof(*icreq), GFP_KERNEL);
+	if (!icreq)
+		return -ENOMEM;
+
+	icresp = kzalloc(sizeof(*icresp), GFP_KERNEL);
+	if (!icresp) {
+		ret = -ENOMEM;
+		goto free_icreq;
+	}
+
+	icreq->hdr.type = nvme_tcp_icreq;
+	icreq->hdr.hlen = sizeof(*icreq);
+	icreq->hdr.pdo = 0;
+	icreq->hdr.plen = cpu_to_le32(icreq->hdr.hlen);
+	icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
+	icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
+	icreq->hpda = 0; /* no alignment constraint */
+	if (queue->hdr_digest)
+		icreq->digest |= NVME_TCP_HDR_DIGEST_ENABLE;
+	if (queue->data_digest)
+		icreq->digest |= NVME_TCP_DATA_DIGEST_ENABLE;
+
+	iov.iov_base = icreq;
+	iov.iov_len = sizeof(*icreq);
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+	if (ret < 0)
+		goto free_icresp;
+
+	memset(&msg, 0, sizeof(msg));
+	iov.iov_base = icresp;
+	iov.iov_len = sizeof(*icresp);
+	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (ret < 0)
+		goto free_icresp;
+
+	ret = -EINVAL;
+	if (icresp->hdr.type != nvme_tcp_icresp) {
+		pr_err("queue %d: bad type returned %d\n",
+			nvme_tcp_queue_id(queue), icresp->hdr.type);
+		goto free_icresp;
+	}
+
+	if (le32_to_cpu(icresp->hdr.plen) != sizeof(*icresp)) {
+		pr_err("queue %d: bad pdu length returned %d\n",
+			nvme_tcp_queue_id(queue), icresp->hdr.plen);
+		goto free_icresp;
+	}
+
+	if (icresp->pfv != NVME_TCP_PFV_1_0) {
+		pr_err("queue %d: bad pfv returned %d\n",
+			nvme_tcp_queue_id(queue), icresp->pfv);
+		goto free_icresp;
+	}
+
+	ctrl_ddgst = !!(icresp->digest & NVME_TCP_DATA_DIGEST_ENABLE);
+	if ((queue->data_digest && !ctrl_ddgst) ||
+	    (!queue->data_digest && ctrl_ddgst)) {
+		pr_err("queue %d: data digest mismatch host: %s ctrl: %s\n",
+			nvme_tcp_queue_id(queue),
+			queue->data_digest ? "enabled" : "disabled",
+			ctrl_ddgst ? "enabled" : "disabled");
+		goto free_icresp;
+	}
+
+	ctrl_hdgst = !!(icresp->digest & NVME_TCP_HDR_DIGEST_ENABLE);
+	if ((queue->hdr_digest && !ctrl_hdgst) ||
+	    (!queue->hdr_digest && ctrl_hdgst)) {
+		pr_err("queue %d: header digest mismatch host: %s ctrl: %s\n",
+			nvme_tcp_queue_id(queue),
+			queue->hdr_digest ? "enabled" : "disabled",
+			ctrl_hdgst ? "enabled" : "disabled");
+		goto free_icresp;
+	}
+
+	if (icresp->cpda != 0) {
+		pr_err("queue %d: unsupported cpda returned %d\n",
+			nvme_tcp_queue_id(queue), icresp->cpda);
+		goto free_icresp;
+	}
+
+	ret = 0;
+free_icresp:
+	kfree(icresp);
+free_icreq:
+	kfree(icreq);
+	return ret;
+}
+
+static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
+		int qid, size_t queue_size)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+	struct linger sol = { .l_onoff = 1, .l_linger = 0 };
+	int ret, opt, rcv_pdu_size;
+
+	queue->ctrl = ctrl;
+	INIT_LIST_HEAD(&queue->send_list);
+	spin_lock_init(&queue->lock);
+	INIT_WORK(&queue->io_work, nvme_tcp_io_work);
+	queue->queue_size = queue_size;
+
+	if (qid > 0)
+		queue->cmnd_capsule_len = ctrl->ctrl.ioccsz * 16;
+	else
+		queue->cmnd_capsule_len = sizeof(struct nvme_command) +
+						NVME_TCP_ADMIN_CCSZ;
+
+	ret = sock_create(ctrl->addr.ss_family, SOCK_STREAM,
+			IPPROTO_TCP, &queue->sock);
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to create socket: %d\n", ret);
+		return ret;
+	}
+
+	/* Single syn retry */
+	opt = 1;
+	ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_SYNCNT,
+			(char *)&opt, sizeof(opt));
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to set TCP_SYNCNT sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	/* Set TCP no delay */
+	opt = 1;
+	ret = kernel_setsockopt(queue->sock, IPPROTO_TCP,
+			TCP_NODELAY, (char *)&opt, sizeof(opt));
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to set TCP_NODELAY sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	/*
+	 * Cleanup whatever is sitting in the TCP transmit queue on socket
+	 * close. This is done to prevent stale data from being sent should
+	 * the network connection be restored before TCP times out.
+	 */
+	ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_LINGER,
+			(char *)&sol, sizeof(sol));
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to set SO_LINGER sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	queue->sock->sk->sk_allocation = GFP_ATOMIC;
+	queue->io_cpu = (qid == 0) ? 0 : qid - 1;
+	queue->request = NULL;
+	queue->rcv.data_remaining = 0;
+	queue->rcv.ddgst_remaining = 0;
+	queue->rcv.pdu_remaining = 0;
+	queue->rcv.pdu_offset = 0;
+	sk_set_memalloc(queue->sock->sk);
+
+	if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR) {
+		ret = kernel_bind(queue->sock, (struct sockaddr *)&ctrl->src_addr,
+			sizeof(ctrl->src_addr));
+		if (ret) {
+			dev_err(ctrl->ctrl.device,
+				"failed to bind queue %d socket %d\n",
+				qid, ret);
+			goto err_sock;
+		}
+	}
+
+	queue->hdr_digest = nctrl->opts->hdr_digest;
+	queue->data_digest = nctrl->opts->data_digest;
+	if (queue->hdr_digest || queue->data_digest) {
+		ret = nvme_tcp_alloc_crypto(queue);
+		if (ret) {
+			dev_err(ctrl->ctrl.device,
+				"failed to allocate queue %d crypto\n", qid);
+			goto err_sock;
+		}
+	}
+
+	rcv_pdu_size = sizeof(struct nvme_tcp_rsp_pdu) +
+			nvme_tcp_hdgst_len(queue);
+	queue->rcv.pdu = kmalloc(rcv_pdu_size, GFP_KERNEL);
+	if (!queue->rcv.pdu) {
+		ret = -ENOMEM;
+		goto err_crypto;
+	}
+
+	dev_dbg(ctrl->ctrl.device, "connecting queue %d\n",
+			nvme_tcp_queue_id(queue));
+
+	ret = kernel_connect(queue->sock, (struct sockaddr *)&ctrl->addr,
+		sizeof(ctrl->addr), 0);
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to connect socket: %d\n", ret);
+		goto err_rcv_pdu;
+	}
+
+	ret = nvme_tcp_init_connection(queue);
+	if (ret)
+		goto err_init_connect;
+
+	queue->rd_enabled = true;
+	set_bit(NVME_TCP_Q_ALLOCATED, &queue->flags);
+	nvme_tcp_init_recv_ctx(queue);
+
+	write_lock_bh(&queue->sock->sk->sk_callback_lock);
+	queue->sock->sk->sk_user_data = queue;
+	queue->sc = queue->sock->sk->sk_state_change;
+	queue->dr = queue->sock->sk->sk_data_ready;
+	queue->ws = queue->sock->sk->sk_write_space;
+	queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
+	queue->sock->sk->sk_state_change = nvme_tcp_state_change;
+	queue->sock->sk->sk_write_space = nvme_tcp_write_space;
+	write_unlock_bh(&queue->sock->sk->sk_callback_lock);
+
+	return 0;
+
+err_init_connect:
+	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+err_rcv_pdu:
+	kfree(queue->rcv.pdu);
+err_crypto:
+	if (queue->hdr_digest || queue->data_digest)
+		nvme_tcp_free_crypto(queue);
+err_sock:
+	sock_release(queue->sock);
+	queue->sock = NULL;
+	return ret;
+}
+
+static void nvme_tcp_restore_sock_calls(struct nvme_tcp_queue *queue)
+{
+	struct socket *sock = queue->sock;
+
+	write_lock_bh(&sock->sk->sk_callback_lock);
+	sock->sk->sk_user_data  = NULL;
+	sock->sk->sk_data_ready = queue->dr;
+	sock->sk->sk_state_change = queue->sc;
+	sock->sk->sk_write_space  = queue->ws;
+	write_unlock_bh(&sock->sk->sk_callback_lock);
+}
+
+static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
+{
+	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+	nvme_tcp_restore_sock_calls(queue);
+	cancel_work_sync(&queue->io_work);
+}
+
+static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+	if (!test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags))
+		return;
+
+	__nvme_tcp_stop_queue(queue);
+}
+
+static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	int ret;
+
+	if (idx)
+		ret = nvmf_connect_io_queue(nctrl, idx);
+	else
+		ret = nvmf_connect_admin_queue(nctrl);
+
+	if (!ret) {
+		set_bit(NVME_TCP_Q_LIVE, &ctrl->queues[idx].flags);
+	} else {
+		__nvme_tcp_stop_queue(&ctrl->queues[idx]);
+		dev_err(nctrl->device,
+			"failed to connect queue: %d ret=%d\n", idx, ret);
+	}
+	return ret;
+}
+
+static void nvme_tcp_free_tagset(struct nvme_ctrl *nctrl,
+		struct blk_mq_tag_set *set)
+{
+	blk_mq_free_tag_set(set);
+}
+
+static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
+		bool admin)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct blk_mq_tag_set *set;
+	int ret;
+
+	if (admin) {
+		set = &ctrl->admin_tag_set;
+		memset(set, 0, sizeof(*set));
+		set->ops = &nvme_tcp_admin_mq_ops;
+		set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+		set->reserved_tags = 2; /* connect + keep-alive */
+		set->numa_node = NUMA_NO_NODE;
+		set->cmd_size = sizeof(struct nvme_tcp_request);
+		set->driver_data = ctrl;
+		set->nr_hw_queues = 1;
+		set->timeout = ADMIN_TIMEOUT;
+	} else {
+		set = &ctrl->tag_set;
+		memset(set, 0, sizeof(*set));
+		set->ops = &nvme_tcp_mq_ops;
+		set->queue_depth = nctrl->sqsize + 1;
+		set->reserved_tags = 1; /* fabric connect */
+		set->numa_node = NUMA_NO_NODE;
+		set->flags = BLK_MQ_F_SHOULD_MERGE;
+		set->cmd_size = sizeof(struct nvme_tcp_request);
+		set->driver_data = ctrl;
+		set->nr_hw_queues = nctrl->queue_count - 1;
+		set->timeout = NVME_IO_TIMEOUT;
+	}
+
+	ret = blk_mq_alloc_tag_set(set);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return set;
+}
+
+static void nvme_tcp_free_admin_queue(struct nvme_ctrl *ctrl)
+{
+	if (to_tcp_ctrl(ctrl)->async_req.pdu) {
+		nvme_tcp_free_async_req(to_tcp_ctrl(ctrl));
+		to_tcp_ctrl(ctrl)->async_req.pdu = NULL;
+	}
+
+	nvme_tcp_free_queue(ctrl, 0);
+}
+
+static void nvme_tcp_free_io_queues(struct nvme_ctrl *ctrl)
+{
+	int i;
+
+	for (i = 1; i < ctrl->queue_count; i++)
+		nvme_tcp_free_queue(ctrl, i);
+}
+
+static void nvme_tcp_stop_admin_queue(struct nvme_ctrl *ctrl)
+{
+	nvme_tcp_stop_queue(ctrl, 0);
+}
+
+static void nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl)
+{
+	int i;
+
+	for (i = 1; i < ctrl->queue_count; i++)
+		nvme_tcp_stop_queue(ctrl, i);
+}
+
+static int nvme_tcp_start_admin_queue(struct nvme_ctrl *ctrl)
+{
+	return nvme_tcp_start_queue(ctrl, 0);
+}
+
+static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl)
+{
+	int i, ret = 0;
+
+	for (i = 1; i < ctrl->queue_count; i++) {
+		ret = nvme_tcp_start_queue(ctrl, i);
+		if (ret)
+			goto out_stop_queues;
+	}
+
+	return 0;
+
+out_stop_queues:
+	for (i--; i >= 1; i--)
+		nvme_tcp_stop_queue(ctrl, i);
+	return ret;
+}
+
+static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
+{
+	int ret;
+
+	ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
+	if (ret)
+		return ret;
+
+	ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
+	if (ret)
+		goto out_free_queue;
+
+	return 0;
+
+out_free_queue:
+	nvme_tcp_free_queue(ctrl, 0);
+	return ret;
+}
+
+static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+	int i, ret;
+
+	for (i = 1; i < ctrl->queue_count; i++) {
+		ret = nvme_tcp_alloc_queue(ctrl, i,
+				ctrl->sqsize + 1);
+		if (ret)
+			goto out_free_queues;
+	}
+
+	return 0;
+
+out_free_queues:
+	for (i--; i >= 1; i--)
+		nvme_tcp_free_queue(ctrl, i);
+
+	return ret;
+}
+
+static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl)
+{
+	return min(ctrl->queue_count - 1, num_online_cpus());
+}
+
+static int nvme_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+	unsigned int nr_io_queues;
+	int ret;
+
+	nr_io_queues = nvme_tcp_nr_io_queues(ctrl);
+	ret = nvme_set_queue_count(ctrl, &nr_io_queues);
+	if (ret)
+		return ret;
+
+	ctrl->queue_count = nr_io_queues + 1;
+	if (ctrl->queue_count < 2)
+		return 0;
+
+	dev_info(ctrl->device,
+		"creating %d I/O queues.\n", nr_io_queues);
+
+	return nvme_tcp_alloc_io_queues(ctrl);
+}
+
+void nvme_tcp_destroy_io_queues(struct nvme_ctrl *ctrl, bool remove)
+{
+	nvme_tcp_stop_io_queues(ctrl);
+	if (remove) {
+		if (ctrl->ops->flags & NVME_F_FABRICS)
+			blk_cleanup_queue(ctrl->connect_q);
+		nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+	}
+	nvme_tcp_free_io_queues(ctrl);
+}
+
+int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
+{
+	int ret;
+
+	ret = nvme_alloc_io_queues(ctrl);
+	if (ret)
+		return ret;
+
+	if (new) {
+		ctrl->tagset = nvme_tcp_alloc_tagset(ctrl, false);
+		if (IS_ERR(ctrl->tagset)) {
+			ret = PTR_ERR(ctrl->tagset);
+			goto out_free_io_queues;
+		}
+
+		if (ctrl->ops->flags & NVME_F_FABRICS) {
+			ctrl->connect_q = blk_mq_init_queue(ctrl->tagset);
+			if (IS_ERR(ctrl->connect_q)) {
+				ret = PTR_ERR(ctrl->connect_q);
+				goto out_free_tag_set;
+			}
+		}
+       } else {
+		blk_mq_update_nr_hw_queues(ctrl->tagset,
+			ctrl->queue_count - 1);
+       }
+
+	ret = nvme_tcp_start_io_queues(ctrl);
+	if (ret)
+		goto out_cleanup_connect_q;
+
+	return 0;
+
+out_cleanup_connect_q:
+	if (new && (ctrl->ops->flags & NVME_F_FABRICS))
+		blk_cleanup_queue(ctrl->connect_q);
+out_free_tag_set:
+       if (new)
+		nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+out_free_io_queues:
+	nvme_tcp_free_io_queues(ctrl);
+       return ret;
+}
+
+void nvme_tcp_destroy_admin_queue(struct nvme_ctrl *ctrl, bool remove)
+{
+	nvme_tcp_stop_admin_queue(ctrl);
+	if (remove) {
+		free_opal_dev(ctrl->opal_dev);
+		blk_cleanup_queue(ctrl->admin_q);
+		nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+	}
+	nvme_tcp_free_admin_queue(ctrl);
+}
+
+int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
+{
+	int error;
+
+	error = nvme_tcp_alloc_admin_queue(ctrl);
+	if (error)
+		return error;
+
+	if (new) {
+		ctrl->admin_tagset = nvme_tcp_alloc_tagset(ctrl, true);
+		if (IS_ERR(ctrl->admin_tagset)) {
+			error = PTR_ERR(ctrl->admin_tagset);
+			goto out_free_queue;
+		}
+
+		ctrl->admin_q = blk_mq_init_queue(ctrl->admin_tagset);
+		if (IS_ERR(ctrl->admin_q)) {
+			error = PTR_ERR(ctrl->admin_q);
+			goto out_free_tagset;
+		}
+	}
+
+	error = nvme_tcp_start_admin_queue(ctrl);
+	if (error)
+		goto out_cleanup_queue;
+
+	error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap);
+	if (error) {
+		dev_err(ctrl->device,
+			"prop_get NVME_REG_CAP failed\n");
+		goto out_stop_queue;
+	}
+
+	ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize);
+
+	error = nvme_enable_ctrl(ctrl, ctrl->cap);
+	if (error)
+		goto out_stop_queue;
+
+	error = nvme_init_identify(ctrl);
+	if (error)
+		goto out_stop_queue;
+
+	return 0;
+
+out_stop_queue:
+	nvme_tcp_stop_admin_queue(ctrl);
+out_cleanup_queue:
+	if (new)
+		blk_cleanup_queue(ctrl->admin_q);
+out_free_tagset:
+	if (new)
+		nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+out_free_queue:
+	nvme_tcp_free_admin_queue(ctrl);
+	return error;
+}
+
+static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
+		bool remove)
+{
+	blk_mq_quiesce_queue(ctrl->admin_q);
+	nvme_tcp_stop_admin_queue(ctrl);
+	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl);
+	blk_mq_unquiesce_queue(ctrl->admin_q);
+	nvme_tcp_destroy_admin_queue(ctrl, remove);
+}
+
+static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
+		bool remove)
+{
+	if (ctrl->queue_count > 1) {
+		nvme_stop_queues(ctrl);
+		nvme_tcp_stop_io_queues(ctrl);
+		blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl);
+		if (remove)
+			nvme_start_queues(ctrl);
+		nvme_tcp_destroy_io_queues(ctrl, remove);
+	}
+}
+
+static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
+{
+	/* If we are resetting/deleting then do nothing */
+	if (ctrl->state != NVME_CTRL_CONNECTING) {
+		WARN_ON_ONCE(ctrl->state == NVME_CTRL_NEW ||
+			ctrl->state == NVME_CTRL_LIVE);
+		return;
+	}
+
+	if (nvmf_should_reconnect(ctrl)) {
+		dev_info(ctrl->device, "Reconnecting in %d seconds...\n",
+			ctrl->opts->reconnect_delay);
+		queue_delayed_work(nvme_wq, &ctrl->connect_work,
+				ctrl->opts->reconnect_delay * HZ);
+	} else {
+		dev_info(ctrl->device, "Removing controller...\n");
+		nvme_delete_ctrl(ctrl);
+	}
+}
+
+static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
+{
+	struct nvmf_ctrl_options *opts = ctrl->opts;
+	int ret = -EINVAL;
+
+	ret = nvme_tcp_configure_admin_queue(ctrl, new);
+	if (ret)
+		return ret;
+
+	if (ctrl->icdoff) {
+		dev_err(ctrl->device, "icdoff is not supported!\n");
+		goto destroy_admin;
+	}
+
+	if (opts->queue_size > ctrl->sqsize + 1)
+		dev_warn(ctrl->device,
+			"queue_size %zu > ctrl sqsize %u, clamping down\n",
+			opts->queue_size, ctrl->sqsize + 1);
+
+	if (ctrl->sqsize + 1 > ctrl->maxcmd) {
+		dev_warn(ctrl->device,
+			"sqsize %u > ctrl maxcmd %u, clamping down\n",
+			ctrl->sqsize + 1, ctrl->maxcmd);
+		ctrl->sqsize = ctrl->maxcmd - 1;
+	}
+
+	if (ctrl->queue_count > 1) {
+		ret = nvme_tcp_configure_io_queues(ctrl, new);
+		if (ret)
+			goto destroy_admin;
+	}
+
+	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE)) {
+		/* state change failure is ok if we're in DELETING state */
+		WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+		ret = -EINVAL;
+		goto destroy_io;
+	}
+
+	nvme_start_ctrl(ctrl);
+	return 0;
+
+destroy_io:
+	if (ctrl->queue_count > 1)
+		nvme_tcp_destroy_io_queues(ctrl, new);
+destroy_admin:
+	nvme_tcp_stop_admin_queue(ctrl);
+	nvme_tcp_destroy_admin_queue(ctrl, new);
+	return ret;
+}
+
+static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+			struct nvme_ctrl, connect_work);
+
+	++ctrl->nr_reconnects;
+
+	if (nvme_tcp_setup_ctrl(ctrl, false))
+		goto requeue;
+
+	dev_info(ctrl->device, "Successfully reconnected (%d attepmpt)\n",
+			ctrl->nr_reconnects);
+
+	ctrl->nr_reconnects = 0;
+
+	return;
+
+requeue:
+	dev_info(ctrl->device, "Failed reconnect attempt %d\n",
+			ctrl->nr_reconnects);
+	nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_error_recovery_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl = container_of(work,
+			struct nvme_ctrl, err_work);
+
+	nvme_stop_keep_alive(ctrl);
+	nvme_tcp_teardown_io_queues(ctrl, false);
+	/* unquiesce to fail fast pending requests */
+	nvme_start_queues(ctrl);
+	nvme_tcp_teardown_admin_queue(ctrl, false);
+
+	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+		/* state change failure is ok if we're in DELETING state */
+		WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+		return;
+	}
+
+	nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
+{
+	nvme_tcp_teardown_io_queues(ctrl, shutdown);
+	if (shutdown)
+		nvme_shutdown_ctrl(ctrl);
+	else
+		nvme_disable_ctrl(ctrl, ctrl->cap);
+	nvme_tcp_teardown_admin_queue(ctrl, shutdown);
+}
+
+static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
+{
+	nvme_tcp_teardown_ctrl(ctrl, true);
+}
+
+static void nvme_reset_ctrl_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl =
+		container_of(work, struct nvme_ctrl, reset_work);
+
+	nvme_stop_ctrl(ctrl);
+	nvme_tcp_teardown_ctrl(ctrl, false);
+
+	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+		/* state change failure is ok if we're in DELETING state */
+		WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+		return;
+	}
+
+	if (nvme_tcp_setup_ctrl(ctrl, false))
+		goto out_fail;
+
+	return;
+
+out_fail:
+	++ctrl->nr_reconnects;
+	nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
+{
+	cancel_work_sync(&ctrl->err_work);
+	cancel_delayed_work_sync(&ctrl->connect_work);
+}
+
+static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+
+	if (list_empty(&ctrl->list))
+		goto free_ctrl;
+
+	mutex_lock(&nvme_tcp_ctrl_mutex);
+	list_del(&ctrl->list);
+	mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+	nvmf_free_options(nctrl->opts);
+free_ctrl:
+	kfree(ctrl->queues);
+	kfree(ctrl);
+}
+
+static void nvme_tcp_set_sg_null(struct nvme_command *c)
+{
+	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+	sg->addr = 0;
+	sg->length = 0;
+	sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+			NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_request *req, struct nvme_command *c)
+{
+	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+	sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
+	sg->length = cpu_to_le32(req->data_len);
+	sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
+}
+
+static void nvme_tcp_set_sg_host_data(struct nvme_tcp_request *req,
+		struct nvme_command *c)
+{
+	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+	sg->addr = 0;
+	sg->length = cpu_to_le32(req->data_len);
+	sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+			NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(arg);
+	struct nvme_tcp_queue *queue = &ctrl->queues[0];
+	struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
+	struct nvme_command *cmd = &pdu->cmd;
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+	memset(pdu, 0, sizeof(*pdu));
+	pdu->hdr.type = nvme_tcp_cmd;
+	if (queue->hdr_digest)
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+
+	cmd->common.opcode = nvme_admin_async_event;
+	cmd->common.command_id = NVME_AQ_BLK_MQ_DEPTH;
+	cmd->common.flags |= NVME_CMD_SGL_METABUF;
+	nvme_tcp_set_sg_null(cmd);
+
+	ctrl->async_req.snd.state = NVME_TCP_SEND_CMD_PDU;
+	ctrl->async_req.snd.offset = 0;
+	ctrl->async_req.snd.curr_bio = NULL;
+	ctrl->async_req.rcv.curr_bio = NULL;
+	ctrl->async_req.data_len = 0;
+
+	nvme_tcp_queue_request(&ctrl->async_req);
+}
+
+static enum blk_eh_timer_return
+nvme_tcp_timeout(struct request *rq, bool reserved)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
+	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+
+	dev_dbg(ctrl->ctrl.device,
+		"queue %d: timeout request %#x type %d\n",
+		nvme_tcp_queue_id(req->queue), rq->tag,
+		pdu->hdr.type);
+
+	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+		union nvme_result res = {};
+
+		nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+		nvme_end_request(rq, NVME_SC_ABORT_REQ, res);
+		return BLK_EH_DONE;
+	}
+
+	/* queue error recovery */
+	nvme_tcp_error_recovery(&ctrl->ctrl);
+
+	return BLK_EH_RESET_TIMER;
+}
+
+static blk_status_t nvme_tcp_map_data(struct nvme_tcp_queue *queue,
+			struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+	struct nvme_command *c = &pdu->cmd;
+
+	c->common.flags |= NVME_CMD_SGL_METABUF;
+
+	if (!req->data_len) {
+		nvme_tcp_set_sg_null(c);
+		return 0;
+	}
+
+	if (rq_data_dir(rq) == WRITE &&
+	    req->data_len <= nvme_tcp_inline_data_size(queue))
+		nvme_tcp_set_sg_inline(queue, req, c);
+	else
+		nvme_tcp_set_sg_host_data(req, c);
+
+	return 0;
+}
+
+static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
+		struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+	struct nvme_tcp_queue *queue = req->queue;
+	u8 hdgst = nvme_tcp_hdgst_len(queue), ddgst = 0;
+	blk_status_t ret;
+
+	ret = nvme_setup_cmd(ns, rq, &pdu->cmd);
+	if (ret)
+		return ret;
+
+	req->snd.state = NVME_TCP_SEND_CMD_PDU;
+	req->snd.offset = 0;
+	req->snd.data_sent = 0;
+	req->pdu_len = 0;
+	req->pdu_sent = 0;
+	req->data_len = blk_rq_payload_bytes(rq);
+
+	if (rq_data_dir(rq) == WRITE) {
+		req->snd.curr_bio = rq->bio;
+		if (req->data_len <= nvme_tcp_inline_data_size(queue))
+			req->pdu_len = req->data_len;
+	} else {
+		req->rcv.curr_bio = rq->bio;
+		if (req->rcv.curr_bio)
+			nvme_tcp_init_recv_iter(req);
+	}
+
+	pdu->hdr.type = nvme_tcp_cmd;
+	pdu->hdr.flags = 0;
+	if (queue->hdr_digest)
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+	if (queue->data_digest && req->pdu_len) {
+		pdu->hdr.flags |= NVME_TCP_F_DDGST;
+		ddgst = nvme_tcp_ddgst_len(queue);
+	}
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.pdo = req->pdu_len ? pdu->hdr.hlen + hdgst : 0;
+	pdu->hdr.plen =
+		cpu_to_le32(pdu->hdr.hlen + hdgst + req->pdu_len + ddgst);
+
+	ret = nvme_tcp_map_data(queue, rq);
+	if (unlikely(ret)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"Failed to map data (%d)\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	struct nvme_ns *ns = hctx->queue->queuedata;
+	struct nvme_tcp_queue *queue = hctx->driver_data;
+	struct request *rq = bd->rq;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
+	blk_status_t ret;
+
+	if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
+		return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
+
+	ret = nvme_tcp_setup_cmd_pdu(ns, rq);
+	if (unlikely(ret))
+		return ret;
+
+	blk_mq_start_request(rq);
+
+	nvme_tcp_queue_request(req);
+
+	return BLK_STS_OK;
+}
+
+static struct blk_mq_ops nvme_tcp_mq_ops = {
+	.queue_rq	= nvme_tcp_queue_rq,
+	.complete	= nvme_complete_rq,
+	.init_request	= nvme_tcp_init_request,
+	.exit_request	= nvme_tcp_exit_request,
+	.init_hctx	= nvme_tcp_init_hctx,
+	.timeout	= nvme_tcp_timeout,
+};
+
+static struct blk_mq_ops nvme_tcp_admin_mq_ops = {
+	.queue_rq	= nvme_tcp_queue_rq,
+	.complete	= nvme_complete_rq,
+	.init_request	= nvme_tcp_init_request,
+	.exit_request	= nvme_tcp_exit_request,
+	.init_hctx	= nvme_tcp_init_admin_hctx,
+	.timeout	= nvme_tcp_timeout,
+};
+
+static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
+	.name			= "tcp",
+	.module			= THIS_MODULE,
+	.flags			= NVME_F_FABRICS,
+	.reg_read32		= nvmf_reg_read32,
+	.reg_read64		= nvmf_reg_read64,
+	.reg_write32		= nvmf_reg_write32,
+	.free_ctrl		= nvme_tcp_free_ctrl,
+	.submit_async_event	= nvme_tcp_submit_async_event,
+	.delete_ctrl		= nvme_tcp_delete_ctrl,
+	.get_address		= nvmf_get_address,
+	.stop_ctrl		= nvme_tcp_stop_ctrl,
+};
+
+static bool
+nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts)
+{
+	struct nvme_tcp_ctrl *ctrl;
+	bool found = false;
+
+	mutex_lock(&nvme_tcp_ctrl_mutex);
+	list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
+		found = nvmf_ip_options_match(&ctrl->ctrl, opts);
+		if (found)
+			break;
+	}
+	mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+	return found;
+}
+
+static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
+		struct nvmf_ctrl_options *opts)
+{
+	struct nvme_tcp_ctrl *ctrl;
+	int ret;
+
+	ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
+	if (!ctrl)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&ctrl->list);
+	ctrl->ctrl.opts = opts;
+	ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
+	ctrl->ctrl.sqsize = opts->queue_size - 1;
+	ctrl->ctrl.kato = opts->kato;
+
+	INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
+			nvme_tcp_reconnect_ctrl_work);
+	INIT_WORK(&ctrl->ctrl.err_work, nvme_tcp_error_recovery_work);
+	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
+
+	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
+		opts->trsvcid =
+			kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
+		if (!opts->trsvcid) {
+			ret = -ENOMEM;
+			goto out_free_ctrl;
+		}
+		opts->mask |= NVMF_OPT_TRSVCID;
+	}
+
+	ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+			opts->traddr, opts->trsvcid, &ctrl->addr);
+	if (ret) {
+		pr_err("malformed address passed: %s:%s\n",
+			opts->traddr, opts->trsvcid);
+		goto out_free_ctrl;
+	}
+
+	if (opts->mask & NVMF_OPT_HOST_TRADDR) {
+		ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+			opts->host_traddr, NULL, &ctrl->src_addr);
+		if (ret) {
+			pr_err("malformed src address passed: %s\n",
+			       opts->host_traddr);
+			goto out_free_ctrl;
+		}
+	}
+
+	if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) {
+		ret = -EALREADY;
+		goto out_free_ctrl;
+	}
+
+	ctrl->queues = kcalloc(opts->nr_io_queues + 1, sizeof(*ctrl->queues),
+				GFP_KERNEL);
+	if (!ctrl->queues) {
+		ret = -ENOMEM;
+		goto out_free_ctrl;
+	}
+
+	ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0);
+	if (ret)
+		goto out_kfree_queues;
+
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
+		WARN_ON_ONCE(1);
+		ret = -EINTR;
+		goto out_uninit_ctrl;
+	}
+
+	ret = nvme_tcp_setup_ctrl(&ctrl->ctrl, true);
+	if (ret)
+		goto out_uninit_ctrl;
+
+	dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+		ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
+
+	nvme_get_ctrl(&ctrl->ctrl);
+
+	mutex_lock(&nvme_tcp_ctrl_mutex);
+	list_add_tail(&ctrl->list, &nvme_tcp_ctrl_list);
+	mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+	return &ctrl->ctrl;
+
+out_uninit_ctrl:
+	nvme_uninit_ctrl(&ctrl->ctrl);
+	nvme_put_ctrl(&ctrl->ctrl);
+	if (ret > 0)
+		ret = -EIO;
+	return ERR_PTR(ret);
+out_kfree_queues:
+	kfree(ctrl->queues);
+out_free_ctrl:
+	kfree(ctrl);
+	return ERR_PTR(ret);
+}
+
+static struct nvmf_transport_ops nvme_tcp_transport = {
+	.name		= "tcp",
+	.module		= THIS_MODULE,
+	.required_opts	= NVMF_OPT_TRADDR,
+	.allowed_opts	= NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
+			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
+			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST,
+	.create_ctrl	= nvme_tcp_create_ctrl,
+};
+
+static int __init nvme_tcp_init_module(void)
+{
+	nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq",
+			WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+	if (!nvme_tcp_wq)
+		return -ENOMEM;
+
+	nvmf_register_transport(&nvme_tcp_transport);
+	return 0;
+}
+
+static void __exit nvme_tcp_cleanup_module(void)
+{
+	struct nvme_tcp_ctrl *ctrl;
+
+	nvmf_unregister_transport(&nvme_tcp_transport);
+
+	mutex_lock(&nvme_tcp_ctrl_mutex);
+	list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
+		nvme_delete_ctrl(&ctrl->ctrl);
+	mutex_unlock(&nvme_tcp_ctrl_mutex);
+	flush_workqueue(nvme_delete_wq);
+
+	destroy_workqueue(nvme_tcp_wq);
+}
+
+module_init(nvme_tcp_init_module);
+module_exit(nvme_tcp_cleanup_module);
+
+MODULE_LICENSE("GPL v2");
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:

	nvme connect-all -t tcp -a $IPADDR

This requires the latest version of nvme-cli with TCP support.

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
Signed-off-by: Roy Shterman <roys at lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas at lightbitslabs.com>
---
 drivers/nvme/host/Kconfig  |   15 +
 drivers/nvme/host/Makefile |    3 +
 drivers/nvme/host/tcp.c    | 2306 ++++++++++++++++++++++++++++++++++++
 3 files changed, 2324 insertions(+)
 create mode 100644 drivers/nvme/host/tcp.c

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..0f345e207675 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -57,3 +57,18 @@ config NVME_FC
 	  from https://github.com/linux-nvme/nvme-cli.
 
 	  If unsure, say N.
+
+config NVME_TCP
+	tristate "NVM Express over Fabrics TCP host driver"
+	depends on INET
+	depends on BLK_DEV_NVME
+	select NVME_FABRICS
+	help
+	  This provides support for the NVMe over Fabrics protocol using
+	  the TCP transport.  This allows you to use remote block devices
+	  exported using the NVMe protocol set.
+
+	  To configure a NVMe over Fabrics controller use the nvme-cli tool
+	  from https://github.com/linux-nvme/nvme-cli.
+
+	  If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index aea459c65ae1..8a4b671c5f0c 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME)		+= nvme.o
 obj-$(CONFIG_NVME_FABRICS)		+= nvme-fabrics.o
 obj-$(CONFIG_NVME_RDMA)			+= nvme-rdma.o
 obj-$(CONFIG_NVME_FC)			+= nvme-fc.o
+obj-$(CONFIG_NVME_TCP)			+= nvme-tcp.o
 
 nvme-core-y				:= core.o
 nvme-core-$(CONFIG_TRACING)		+= trace.o
@@ -21,3 +22,5 @@ nvme-fabrics-y				+= fabrics.o
 nvme-rdma-y				+= rdma.o
 
 nvme-fc-y				+= fc.o
+
+nvme-tcp-y				+= tcp.o
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
new file mode 100644
index 000000000000..4c583859f0ad
--- /dev/null
+++ b/drivers/nvme/host/tcp.c
@@ -0,0 +1,2306 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP host.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/nvme-tcp.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <linux/blk-mq.h>
+#include <crypto/hash.h>
+
+#include "nvme.h"
+#include "fabrics.h"
+
+struct nvme_tcp_queue;
+
+enum nvme_tcp_send_state {
+	NVME_TCP_SEND_CMD_PDU = 0,
+	NVME_TCP_SEND_H2C_PDU,
+	NVME_TCP_SEND_DATA,
+	NVME_TCP_SEND_DDGST,
+};
+
+struct nvme_tcp_send_ctx {
+	struct bio		*curr_bio;
+	struct iov_iter		iter;
+	size_t			offset;
+	size_t			data_sent;
+	enum nvme_tcp_send_state state;
+};
+
+struct nvme_tcp_recv_ctx {
+	struct iov_iter		iter;
+	struct bio		*curr_bio;
+};
+
+struct nvme_tcp_request {
+	struct nvme_request	req;
+	void			*pdu;
+	struct nvme_tcp_queue	*queue;
+	u32			data_len;
+	u32			pdu_len;
+	u32			pdu_sent;
+	u16			ttag;
+	struct list_head	entry;
+	struct nvme_tcp_recv_ctx rcv;
+	struct nvme_tcp_send_ctx snd;
+	u32			ddgst;
+};
+
+enum nvme_tcp_queue_flags {
+	NVME_TCP_Q_ALLOCATED	= 0,
+	NVME_TCP_Q_LIVE		= 1,
+};
+
+enum nvme_tcp_recv_state {
+	NVME_TCP_RECV_PDU = 0,
+	NVME_TCP_RECV_DATA,
+	NVME_TCP_RECV_DDGST,
+};
+
+struct nvme_tcp_queue_recv_ctx {
+	char		*pdu;
+	int		pdu_remaining;
+	int		pdu_offset;
+	size_t		data_remaining;
+	size_t		ddgst_remaining;
+};
+
+struct nvme_tcp_ctrl;
+struct nvme_tcp_queue {
+	struct socket		*sock;
+	struct work_struct	io_work;
+	int			io_cpu;
+
+	spinlock_t		lock;
+	struct list_head	send_list;
+
+	int			queue_size;
+	size_t			cmnd_capsule_len;
+	struct nvme_tcp_ctrl	*ctrl;
+	unsigned long		flags;
+	bool			rd_enabled;
+
+	struct nvme_tcp_queue_recv_ctx rcv;
+	struct nvme_tcp_request *request;
+
+	bool			hdr_digest;
+	bool			data_digest;
+	struct ahash_request	*rcv_hash;
+	struct ahash_request	*snd_hash;
+	__le32			exp_ddgst;
+	__le32			recv_ddgst;
+
+	struct page_frag_cache	pf_cache;
+
+	void (*sc)(struct sock *);
+	void (*dr)(struct sock *);
+	void (*ws)(struct sock *);
+};
+
+struct nvme_tcp_ctrl {
+	/* read only in the hot path */
+	struct nvme_tcp_queue	*queues;
+	struct blk_mq_tag_set	tag_set;
+
+	/* other member variables */
+	struct list_head	list;
+	struct blk_mq_tag_set	admin_tag_set;
+	struct sockaddr_storage addr;
+	struct sockaddr_storage src_addr;
+	struct nvme_ctrl	ctrl;
+
+	struct nvme_tcp_request async_req;
+};
+
+static LIST_HEAD(nvme_tcp_ctrl_list);
+static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
+static struct workqueue_struct *nvme_tcp_wq;
+static struct blk_mq_ops nvme_tcp_mq_ops;
+static struct blk_mq_ops nvme_tcp_admin_mq_ops;
+
+static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
+{
+	return container_of(ctrl, struct nvme_tcp_ctrl, ctrl);
+}
+
+static inline int nvme_tcp_queue_id(struct nvme_tcp_queue *queue)
+{
+	return queue - queue->ctrl->queues;
+}
+
+static inline struct blk_mq_tags *nvme_tcp_tagset(struct nvme_tcp_queue *queue)
+{
+	u32 queue_idx = nvme_tcp_queue_id(queue);
+
+	if (queue_idx == 0)
+		return queue->ctrl->admin_tag_set.tags[queue_idx];
+	return queue->ctrl->tag_set.tags[queue_idx - 1];
+}
+
+static inline u8 nvme_tcp_hdgst_len(struct nvme_tcp_queue *queue)
+{
+	return queue->hdr_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline u8 nvme_tcp_ddgst_len(struct nvme_tcp_queue *queue)
+{
+	return queue->data_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline size_t nvme_tcp_inline_data_size(struct nvme_tcp_queue *queue)
+{
+	return queue->cmnd_capsule_len - sizeof(struct nvme_command);
+}
+
+static inline bool nvme_tcp_async_req(struct nvme_tcp_request *req)
+{
+	return req == &req->queue->ctrl->async_req;
+}
+
+static inline bool nvme_tcp_has_inline_data(struct nvme_tcp_request *req)
+{
+	struct request *rq;
+	unsigned int bytes;
+
+	if (unlikely(nvme_tcp_async_req(req)))
+		return false; /* async events don't have a request */
+
+	rq = blk_mq_rq_from_pdu(req);
+	bytes = blk_rq_payload_bytes(rq);
+
+	return rq_data_dir(rq) == WRITE && bytes &&
+		bytes <= nvme_tcp_inline_data_size(req->queue);
+}
+
+static inline struct page *nvme_tcp_req_cur_page(struct nvme_tcp_request *req)
+{
+	return req->snd.iter.bvec->bv_page;
+}
+
+static inline size_t nvme_tcp_req_cur_offset(struct nvme_tcp_request *req)
+{
+	return req->snd.iter.bvec->bv_offset + req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req)
+{
+	return min_t(size_t, req->snd.iter.bvec->bv_len - req->snd.iter.iov_offset,
+			req->pdu_len - req->pdu_sent);
+}
+
+static inline size_t nvme_tcp_req_offset(struct nvme_tcp_request *req)
+{
+	return req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req)
+{
+	return rq_data_dir(blk_mq_rq_from_pdu(req)) == WRITE ?
+			req->pdu_len - req->pdu_sent : 0;
+}
+
+static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
+		int len)
+{
+	return nvme_tcp_pdu_data_left(req) <= len;
+}
+
+static void nvme_tcp_init_send_iter(struct nvme_tcp_request *req)
+{
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	struct bio_vec *vec;
+	unsigned int size;
+	int nsegs;
+	size_t offset;
+
+	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		vec = &rq->special_vec;
+		nsegs = 1;
+		size = blk_rq_payload_bytes(rq);
+		offset = 0;
+	} else {
+		struct bio *bio = req->snd.curr_bio;
+
+		vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+		nsegs = bio_segments(bio);
+		size = bio->bi_iter.bi_size;
+		offset = bio->bi_iter.bi_bvec_done;
+	}
+
+	iov_iter_bvec(&req->snd.iter, WRITE, vec, nsegs, size);
+	req->snd.iter.iov_offset = offset;
+}
+
+static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
+		int len)
+{
+	req->snd.data_sent += len;
+	req->pdu_sent += len;
+	iov_iter_advance(&req->snd.iter, len);
+	if (!iov_iter_count(&req->snd.iter) &&
+	    req->snd.data_sent < req->data_len) {
+		req->snd.curr_bio = req->snd.curr_bio->bi_next;
+		nvme_tcp_init_send_iter(req);
+	}
+}
+
+static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+
+	spin_lock_bh(&queue->lock);
+	list_add_tail(&req->entry, &queue->send_list);
+	spin_unlock_bh(&queue->lock);
+
+	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static inline struct nvme_tcp_request *
+nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_request *req;
+
+	spin_lock_bh(&queue->lock);
+	req = list_first_entry_or_null(&queue->send_list,
+			struct nvme_tcp_request, entry);
+	if (req)
+		list_del(&req->entry);
+	spin_unlock_bh(&queue->lock);
+
+	return req;
+}
+
+static inline void nvme_tcp_ddgst_final(struct ahash_request *hash, u32 *dgst)
+{
+	ahash_request_set_crypt(hash, NULL, (u8 *)dgst, 0);
+	crypto_ahash_final(hash);
+}
+
+static inline void nvme_tcp_ddgst_update(struct ahash_request *hash,
+		struct page *page, off_t off, size_t len)
+{
+	struct scatterlist sg;
+
+	sg_init_marker(&sg, 1);
+	sg_set_page(&sg, page, len, off);
+	ahash_request_set_crypt(hash, &sg, NULL, len);
+	crypto_ahash_update(hash);
+}
+
+static inline void nvme_tcp_hdgst(struct ahash_request *hash,
+		void *pdu, size_t len)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, pdu, len);
+	ahash_request_set_crypt(hash, &sg, pdu + len, len);
+	crypto_ahash_digest(hash);
+}
+
+static int nvme_tcp_verify_hdgst(struct nvme_tcp_queue *queue,
+	void *pdu, size_t pdu_len)
+{
+	struct nvme_tcp_hdr *hdr = pdu;
+	__le32 recv_digest;
+	__le32 exp_digest;
+
+	if (unlikely(!(hdr->flags & NVME_TCP_F_HDGST))) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d: header digest flag is cleared\n",
+			nvme_tcp_queue_id(queue));
+		return -EPROTO;
+	}
+
+	recv_digest = *(__le32 *)(pdu + hdr->hlen);
+	nvme_tcp_hdgst(queue->rcv_hash, pdu, pdu_len);
+	exp_digest = *(__le32 *)(pdu + hdr->hlen);
+	if (recv_digest != exp_digest) {
+		dev_err(queue->ctrl->ctrl.device,
+			"header digest error: recv %#x expected %#x\n",
+			le32_to_cpu(recv_digest), le32_to_cpu(exp_digest));
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int nvme_tcp_check_ddgst(struct nvme_tcp_queue *queue, void *pdu)
+{
+	struct nvme_tcp_hdr *hdr = pdu;
+	u32 len;
+
+	len = le32_to_cpu(hdr->plen) - hdr->hlen -
+		((hdr->flags & NVME_TCP_F_HDGST) ? nvme_tcp_hdgst_len(queue) : 0);
+
+	if (unlikely(len && !(hdr->flags & NVME_TCP_F_DDGST))) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d: data digest flag is cleared\n",
+		nvme_tcp_queue_id(queue));
+		return -EPROTO;
+	}
+	crypto_ahash_init(queue->rcv_hash);
+
+	return 0;
+}
+
+static void nvme_tcp_exit_request(struct blk_mq_tag_set *set,
+		struct request *rq, unsigned int hctx_idx)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+	page_frag_free(req->pdu);
+}
+
+static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
+		struct request *rq, unsigned int hctx_idx,
+		unsigned int numa_node)
+{
+	struct nvme_tcp_ctrl *ctrl = set->driver_data;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	int queue_idx = (set == &ctrl->tag_set) ? hctx_idx + 1 : 0;
+	struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+	req->pdu = page_frag_alloc(&queue->pf_cache,
+		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+		GFP_KERNEL | __GFP_ZERO);
+	if (!req->pdu)
+		return -ENOMEM;
+
+	req->queue = queue;
+	nvme_req(rq)->ctrl = &ctrl->ctrl;
+
+	return 0;
+}
+
+static int nvme_tcp_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+		unsigned int hctx_idx)
+{
+	struct nvme_tcp_ctrl *ctrl = data;
+	struct nvme_tcp_queue *queue = &ctrl->queues[hctx_idx + 1];
+
+	BUG_ON(hctx_idx >= ctrl->ctrl.queue_count);
+
+	hctx->driver_data = queue;
+	return 0;
+}
+
+static int nvme_tcp_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+		unsigned int hctx_idx)
+{
+	struct nvme_tcp_ctrl *ctrl = data;
+	struct nvme_tcp_queue *queue = &ctrl->queues[0];
+
+	BUG_ON(hctx_idx != 0);
+
+	hctx->driver_data = queue;
+	return 0;
+}
+
+static enum nvme_tcp_recv_state nvme_tcp_recv_state(struct nvme_tcp_queue *queue)
+{
+	return  (queue->rcv.pdu_remaining) ? NVME_TCP_RECV_PDU :
+		(queue->rcv.ddgst_remaining) ? NVME_TCP_RECV_DDGST :
+		NVME_TCP_RECV_DATA;
+}
+
+static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+
+	rcv->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
+				nvme_tcp_hdgst_len(queue);
+	rcv->pdu_offset = 0;
+	rcv->data_remaining = -1;
+	rcv->ddgst_remaining = 0;
+}
+
+void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
+{
+	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+		return;
+
+	queue_work(nvme_wq, &ctrl->err_work);
+}
+
+static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
+		struct nvme_completion *cqe)
+{
+	struct request *rq;
+	struct nvme_tcp_request *req;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
+	if (!rq) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag 0x%x not found\n",
+			nvme_tcp_queue_id(queue), cqe->command_id);
+		nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+		return -EINVAL;
+	}
+	req = blk_mq_rq_to_pdu(rq);
+
+	nvme_end_request(rq, cqe->status, cqe->result);
+
+	return 0;
+}
+
+static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_data_pdu *pdu)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_request *req;
+	struct request *rq;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	if (!rq) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag %#x not found\n",
+			nvme_tcp_queue_id(queue), pdu->command_id);
+		return -ENOENT;
+	}
+	req = blk_mq_rq_to_pdu(rq);
+
+	if (!blk_rq_payload_bytes(rq)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag %#x unexpected data\n",
+			nvme_tcp_queue_id(queue), rq->tag);
+		return -EIO;
+	}
+
+	rcv->data_remaining = le32_to_cpu(pdu->data_length);
+	/* No support for out-of-order */
+	WARN_ON(le32_to_cpu(pdu->data_offset));
+
+	return 0;
+
+}
+
+static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_rsp_pdu *pdu)
+{
+	struct nvme_completion *cqe = &pdu->cqe;
+	int ret = 0;
+
+	/*
+	 * AEN requests are special as they don't time out and can
+	 * survive any kind of queue freeze and often don't respond to
+	 * aborts.  We don't even bother to allocate a struct request
+	 * for them but rather special case them here.
+	 */
+	if (unlikely(nvme_tcp_queue_id(queue) == 0 &&
+	    cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH))
+		nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
+				&cqe->result);
+	else
+		ret = nvme_tcp_process_nvme_cqe(queue, cqe);
+
+	return ret;
+}
+
+static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
+		struct nvme_tcp_r2t_pdu *pdu)
+{
+	struct nvme_tcp_data_pdu *data = req->pdu;
+	struct nvme_tcp_queue *queue = req->queue;
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+	u8 ddgst = nvme_tcp_ddgst_len(queue);
+
+	req->pdu_len = le32_to_cpu(pdu->r2t_length);
+	req->pdu_sent = 0;
+
+	if (unlikely(req->snd.data_sent + req->pdu_len > req->data_len)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"req %d r2t length %u exceeded data length %u (%zu sent)\n",
+			rq->tag, req->pdu_len, req->data_len,
+			req->snd.data_sent);
+		return -EPROTO;
+	}
+
+	if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->snd.data_sent)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"req %d unexpected r2t offset %u (expected %zu)\n",
+			rq->tag, le32_to_cpu(pdu->r2t_offset),
+			req->snd.data_sent);
+		return -EPROTO;
+	}
+
+	memset(data, 0, sizeof(*data));
+	data->hdr.type = nvme_tcp_h2c_data;
+	data->hdr.flags = NVME_TCP_F_DATA_LAST;
+	if (queue->hdr_digest)
+		data->hdr.flags |= NVME_TCP_F_HDGST;
+	if (queue->data_digest)
+		data->hdr.flags |= NVME_TCP_F_DDGST;
+	data->hdr.hlen = sizeof(*data);
+	data->hdr.pdo = data->hdr.hlen + hdgst;
+	data->hdr.plen =
+		cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
+	data->ttag = pdu->ttag;
+	data->command_id = rq->tag;
+	data->data_offset = cpu_to_le32(req->snd.data_sent);
+	data->data_length = cpu_to_le32(req->pdu_len);
+	return 0;
+}
+
+static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_r2t_pdu *pdu)
+{
+	struct nvme_tcp_request *req;
+	struct request *rq;
+	int ret;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	if (!rq) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag %#x not found\n",
+			nvme_tcp_queue_id(queue), pdu->command_id);
+		return -ENOENT;
+	}
+	req = blk_mq_rq_to_pdu(rq);
+
+	ret = nvme_tcp_setup_h2c_data_pdu(req, pdu);
+	if (unlikely(ret))
+		return ret;
+
+	req->snd.state = NVME_TCP_SEND_H2C_PDU;
+	req->snd.offset = 0;
+
+	nvme_tcp_queue_request(req);
+
+	return 0;
+}
+
+static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+		unsigned int *offset, size_t *len)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_hdr *hdr;
+	size_t rcv_len = min_t(size_t, *len, rcv->pdu_remaining);
+	int ret;
+
+	ret = skb_copy_bits(skb, *offset, &rcv->pdu[rcv->pdu_offset], rcv_len);
+	if (unlikely(ret))
+		return ret;
+
+	rcv->pdu_remaining -= rcv_len;
+	rcv->pdu_offset += rcv_len;
+	*offset += rcv_len;
+	*len -= rcv_len;
+	if (queue->rcv.pdu_remaining)
+		return 0;
+
+	hdr = (void *)rcv->pdu;
+	if (queue->hdr_digest) {
+		ret = nvme_tcp_verify_hdgst(queue, rcv->pdu, hdr->hlen);
+		if (unlikely(ret))
+			return ret;
+	}
+
+
+	if (queue->data_digest) {
+		ret = nvme_tcp_check_ddgst(queue, rcv->pdu);
+		if (unlikely(ret))
+			return ret;
+	}
+
+	switch (hdr->type) {
+	case nvme_tcp_c2h_data:
+		ret = nvme_tcp_handle_c2h_data(queue, (void *)rcv->pdu);
+		break;
+	case nvme_tcp_rsp:
+		nvme_tcp_init_recv_ctx(queue);
+		ret = nvme_tcp_handle_comp(queue, (void *)rcv->pdu);
+		break;
+	case nvme_tcp_r2t:
+		nvme_tcp_init_recv_ctx(queue);
+		ret = nvme_tcp_handle_r2t(queue, (void *)rcv->pdu);
+		break;
+	default:
+		dev_err(queue->ctrl->ctrl.device, "unsupported pdu type (%d)\n",
+			hdr->type);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static void nvme_tcp_init_recv_iter(struct nvme_tcp_request *req)
+{
+	struct bio *bio = req->rcv.curr_bio;
+	struct bio_vec *vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+	unsigned int nsegs = bio_segments(bio);
+
+	iov_iter_bvec(&req->rcv.iter, READ, vec, nsegs,
+		bio->bi_iter.bi_size);
+	req->rcv.iter.iov_offset = bio->bi_iter.bi_bvec_done;
+}
+
+static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+			      unsigned int *offset, size_t *len)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+	struct nvme_tcp_data_pdu *pdu = (void *)rcv->pdu;
+	struct nvme_tcp_request *req;
+	struct request *rq;
+
+	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+	if (!rq) {
+		dev_err(queue->ctrl->ctrl.device,
+			"queue %d tag %#x not found\n",
+			nvme_tcp_queue_id(queue), pdu->command_id);
+		return -ENOENT;
+	}
+	req = blk_mq_rq_to_pdu(rq);
+
+	while (true) {
+		int recv_len, ret;
+
+		recv_len = min_t(size_t, *len, rcv->data_remaining);
+		if (!recv_len)
+			break;
+
+		/*
+		 * FIXME: This assumes that data comes in-order,
+		 *  need to handle the out-of-order case.
+		 */
+		if (!iov_iter_count(&req->rcv.iter)) {
+			req->rcv.curr_bio = req->rcv.curr_bio->bi_next;
+
+			/*
+			 * If we don`t have any bios it means that controller
+			 * sent more data than we requested, hence error
+			 */
+			if (!req->rcv.curr_bio) {
+				dev_err(queue->ctrl->ctrl.device,
+					"queue %d no space in request %#x",
+					nvme_tcp_queue_id(queue), rq->tag);
+				nvme_tcp_init_recv_ctx(queue);
+				return -EIO;
+			}
+			nvme_tcp_init_recv_iter(req);
+		}
+
+		/* we can read only from what is left in this bio */
+		recv_len = min_t(size_t, recv_len,
+				iov_iter_count(&req->rcv.iter));
+
+		if (queue->data_digest)
+			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
+				&req->rcv.iter, recv_len, queue->rcv_hash);
+		else
+			ret = skb_copy_datagram_iter(skb, *offset,
+					&req->rcv.iter, recv_len);
+		if (ret) {
+			dev_err(queue->ctrl->ctrl.device,
+				"queue %d failed to copy request %#x data",
+				nvme_tcp_queue_id(queue), rq->tag);
+			return ret;
+		}
+
+		*len -= recv_len;
+		*offset += recv_len;
+		rcv->data_remaining -= recv_len;
+	}
+
+	if (!rcv->data_remaining) {
+		if (queue->data_digest) {
+			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
+			rcv->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
+		} else {
+			nvme_tcp_init_recv_ctx(queue);
+		}
+	}
+
+	return 0;
+}
+
+static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
+		struct sk_buff *skb, unsigned int *offset, size_t *len)
+{
+	struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+	char *ddgst = (char *)&queue->recv_ddgst;
+	size_t recv_len = min_t(size_t, *len, rcv->ddgst_remaining);
+	off_t off = NVME_TCP_DIGEST_LENGTH - rcv->ddgst_remaining;
+	int ret;
+
+	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
+	if (unlikely(ret))
+		return ret;
+
+	rcv->ddgst_remaining -= recv_len;
+	*offset += recv_len;
+	*len -= recv_len;
+	if (rcv->ddgst_remaining)
+		return 0;
+
+	if (queue->recv_ddgst != queue->exp_ddgst) {
+		dev_err(queue->ctrl->ctrl.device,
+			"data digest error: recv %#x expected %#x\n",
+			le32_to_cpu(queue->recv_ddgst),
+			le32_to_cpu(queue->exp_ddgst));
+		return -EIO;
+	}
+
+	nvme_tcp_init_recv_ctx(queue);
+	return 0;
+}
+
+static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+			     unsigned int offset, size_t len)
+{
+	struct nvme_tcp_queue *queue = desc->arg.data;
+	size_t consumed = len;
+	int result;
+
+	while (len) {
+		switch (nvme_tcp_recv_state(queue)) {
+		case NVME_TCP_RECV_PDU:
+			result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
+			break;
+		case NVME_TCP_RECV_DATA:
+			result = nvme_tcp_recv_data(queue, skb, &offset, &len);
+			break;
+		case NVME_TCP_RECV_DDGST:
+			result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
+			break;
+		default:
+			result = -EFAULT;
+		}
+		if (result) {
+			dev_err(queue->ctrl->ctrl.device,
+				"receive failed:  %d\n", result);
+			queue->rd_enabled = false;
+			nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+			return result;
+		}
+	}
+
+	return consumed;
+}
+
+static void nvme_tcp_data_ready(struct sock *sk)
+{
+	struct nvme_tcp_queue *queue;
+
+	read_lock(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (unlikely(!queue || !queue->rd_enabled))
+		goto done;
+
+	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+done:
+	read_unlock(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_write_space(struct sock *sk)
+{
+	struct nvme_tcp_queue *queue;
+
+	read_lock_bh(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+
+	if (!queue)
+		goto done;
+
+	if (sk_stream_is_writeable(sk)) {
+		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	}
+done:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_state_change(struct sock *sk)
+{
+	struct nvme_tcp_queue *queue;
+
+	read_lock(&sk->sk_callback_lock);
+	queue = sk->sk_user_data;
+	if (!queue)
+		goto done;
+
+	switch (sk->sk_state) {
+	case TCP_CLOSE:
+	case TCP_CLOSE_WAIT:
+	case TCP_LAST_ACK:
+	case TCP_FIN_WAIT1:
+	case TCP_FIN_WAIT2:
+		/* fallthrough */
+		nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+		break;
+	default:
+		dev_info(queue->ctrl->ctrl.device,
+			"queue %d socket state %d\n",
+			nvme_tcp_queue_id(queue), sk->sk_state);
+	}
+
+	queue->sc(sk);
+done:
+	read_unlock(&sk->sk_callback_lock);
+}
+
+static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
+{
+	queue->request = NULL;
+}
+
+static void nvme_tcp_fail_request(struct nvme_tcp_request *req)
+{
+	union nvme_result res = {};
+
+	nvme_end_request(blk_mq_rq_from_pdu(req),
+		NVME_SC_DATA_XFER_ERROR, res);
+}
+
+static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+
+	while (true) {
+		struct page *page = nvme_tcp_req_cur_page(req);
+		size_t offset = nvme_tcp_req_cur_offset(req);
+		size_t len = nvme_tcp_req_cur_length(req);
+		bool last = nvme_tcp_pdu_last_send(req, len);
+		int ret, flags = MSG_DONTWAIT;
+
+		if (last && !queue->data_digest)
+			flags |= MSG_EOR;
+		else
+			flags |= MSG_MORE;
+
+		ret = kernel_sendpage(queue->sock, page, offset, len, flags);
+		if (ret <= 0)
+			return ret;
+
+		nvme_tcp_advance_req(req, ret);
+		if (queue->data_digest)
+			nvme_tcp_ddgst_update(queue->snd_hash, page, offset, ret);
+
+		/* fully successfull last write*/
+		if (last && ret == len) {
+			if (queue->data_digest) {
+				nvme_tcp_ddgst_final(queue->snd_hash,
+					&req->ddgst);
+				req->snd.state = NVME_TCP_SEND_DDGST;
+				req->snd.offset = 0;
+			} else {
+				nvme_tcp_done_send_req(queue);
+			}
+			return 1;
+		}
+	}
+	return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+	struct nvme_tcp_send_ctx *snd = &req->snd;
+	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+	bool inline_data = nvme_tcp_has_inline_data(req);
+	int flags = MSG_DONTWAIT | (inline_data ? MSG_MORE : MSG_EOR);
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+	int len = sizeof(*pdu) + hdgst - snd->offset;
+	int ret;
+
+	if (queue->hdr_digest && !snd->offset)
+		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+			offset_in_page(pdu) + snd->offset, len,  flags);
+	if (unlikely(ret <= 0))
+		return ret;
+
+	len -= ret;
+	if (!len) {
+		if (inline_data) {
+			req->snd.state = NVME_TCP_SEND_DATA;
+			if (queue->data_digest)
+				crypto_ahash_init(queue->snd_hash);
+			nvme_tcp_init_send_iter(req);
+		} else {
+			nvme_tcp_done_send_req(queue);
+		}
+		return 1;
+	}
+	snd->offset += ret;
+
+	return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+	struct nvme_tcp_send_ctx *snd = &req->snd;
+	struct nvme_tcp_data_pdu *pdu = req->pdu;
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+	int len = sizeof(*pdu) - snd->offset + hdgst;
+	int ret;
+
+	if (queue->hdr_digest && !snd->offset)
+		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+			offset_in_page(pdu) + snd->offset, len,
+			MSG_DONTWAIT | MSG_MORE);
+	if (unlikely(ret <= 0))
+		return ret;
+
+	len -= ret;
+	if (!len) {
+		req->snd.state = NVME_TCP_SEND_DATA;
+		if (queue->data_digest)
+			crypto_ahash_init(queue->snd_hash);
+		if (!req->snd.data_sent)
+			nvme_tcp_init_send_iter(req);
+		return 1;
+	}
+	snd->offset += ret;
+
+	return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
+{
+	struct nvme_tcp_queue *queue = req->queue;
+	int ret;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+	struct kvec iov = {
+		.iov_base = &req->ddgst + req->snd.offset,
+		.iov_len = NVME_TCP_DIGEST_LENGTH - req->snd.offset
+	};
+
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+	if (unlikely(ret <= 0))
+		return ret;
+
+	if (req->snd.offset + ret == NVME_TCP_DIGEST_LENGTH) {
+		nvme_tcp_done_send_req(queue);
+		return 1;
+	}
+
+	req->snd.offset += ret;
+	return -EAGAIN;
+}
+
+static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_request *req;
+	int ret = 1;
+
+	if (!queue->request) {
+		queue->request = nvme_tcp_fetch_request(queue);
+		if (!queue->request)
+			return 0;
+	}
+	req = queue->request;
+
+	if (req->snd.state == NVME_TCP_SEND_CMD_PDU) {
+		ret = nvme_tcp_try_send_cmd_pdu(req);
+		if (ret <= 0)
+			goto done;
+		if (!nvme_tcp_has_inline_data(req))
+			return ret;
+	}
+
+	if (req->snd.state == NVME_TCP_SEND_H2C_PDU) {
+		ret = nvme_tcp_try_send_data_pdu(req);
+		if (ret <= 0)
+			goto done;
+	}
+
+	if (req->snd.state == NVME_TCP_SEND_DATA) {
+		ret = nvme_tcp_try_send_data(req);
+		if (ret <= 0)
+			goto done;
+	}
+
+	if (req->snd.state == NVME_TCP_SEND_DDGST)
+		ret = nvme_tcp_try_send_ddgst(req);
+done:
+	if (ret == -EAGAIN)
+		ret = 0;
+	return ret;
+}
+
+static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
+{
+	struct sock *sk = queue->sock->sk;
+	read_descriptor_t rd_desc;
+	int consumed;
+
+	rd_desc.arg.data = queue;
+	rd_desc.count = 1;
+	lock_sock(sk);
+	consumed = tcp_read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+	release_sock(sk);
+	return consumed;
+}
+
+static void nvme_tcp_io_work(struct work_struct *w)
+{
+	struct nvme_tcp_queue *queue =
+		container_of(w, struct nvme_tcp_queue, io_work);
+	unsigned long start = jiffies + msecs_to_jiffies(1);
+
+	do {
+		bool pending = false;
+		int result;
+
+		result = nvme_tcp_try_send(queue);
+		if (result > 0) {
+			pending = true;
+		} else if (unlikely(result < 0)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"failed to send request %d\n", result);
+			if (result != -EPIPE)
+				nvme_tcp_fail_request(queue->request);
+			nvme_tcp_done_send_req(queue);
+			return;
+		}
+
+		result = nvme_tcp_try_recv(queue);
+		if (result > 0)
+			pending = true;
+
+		if (!pending)
+			return;
+
+	} while (time_after(jiffies, start)); /* quota is exhausted */
+
+	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
+{
+	struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
+
+	ahash_request_free(queue->rcv_hash);
+	ahash_request_free(queue->snd_hash);
+	crypto_free_ahash(tfm);
+}
+
+static int nvme_tcp_alloc_crypto(struct nvme_tcp_queue *queue)
+{
+	struct crypto_ahash *tfm;
+
+	tfm = crypto_alloc_ahash("crc32c", 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(tfm))
+		return PTR_ERR(tfm);
+
+	queue->snd_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+	if (!queue->snd_hash)
+		goto free_tfm;
+	ahash_request_set_callback(queue->snd_hash, 0, NULL, NULL);
+
+	queue->rcv_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+	if (!queue->rcv_hash)
+		goto free_snd_hash;
+	ahash_request_set_callback(queue->rcv_hash, 0, NULL, NULL);
+
+	return 0;
+free_snd_hash:
+	ahash_request_free(queue->snd_hash);
+free_tfm:
+	crypto_free_ahash(tfm);
+	return -ENOMEM;
+}
+
+static void nvme_tcp_free_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+	struct nvme_tcp_request *async = &ctrl->async_req;
+
+	page_frag_free(async->pdu);
+}
+
+static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+	struct nvme_tcp_queue *queue = &ctrl->queues[0];
+	struct nvme_tcp_request *async = &ctrl->async_req;
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+	async->pdu = page_frag_alloc(&queue->pf_cache,
+		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+		GFP_KERNEL | __GFP_ZERO);
+	if (!async->pdu)
+		return -ENOMEM;
+
+	async->queue = &ctrl->queues[0];
+	return 0;
+}
+
+static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+	if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
+		return;
+
+	if (queue->hdr_digest || queue->data_digest)
+		nvme_tcp_free_crypto(queue);
+
+	sock_release(queue->sock);
+	kfree(queue->rcv.pdu);
+}
+
+static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
+{
+	struct nvme_tcp_icreq_pdu *icreq;
+	struct nvme_tcp_icresp_pdu *icresp;
+	struct msghdr msg = {};
+	struct kvec iov;
+	bool ctrl_hdgst, ctrl_ddgst;
+	int ret;
+
+	icreq = kzalloc(sizeof(*icreq), GFP_KERNEL);
+	if (!icreq)
+		return -ENOMEM;
+
+	icresp = kzalloc(sizeof(*icresp), GFP_KERNEL);
+	if (!icresp) {
+		ret = -ENOMEM;
+		goto free_icreq;
+	}
+
+	icreq->hdr.type = nvme_tcp_icreq;
+	icreq->hdr.hlen = sizeof(*icreq);
+	icreq->hdr.pdo = 0;
+	icreq->hdr.plen = cpu_to_le32(icreq->hdr.hlen);
+	icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
+	icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
+	icreq->hpda = 0; /* no alignment constraint */
+	if (queue->hdr_digest)
+		icreq->digest |= NVME_TCP_HDR_DIGEST_ENABLE;
+	if (queue->data_digest)
+		icreq->digest |= NVME_TCP_DATA_DIGEST_ENABLE;
+
+	iov.iov_base = icreq;
+	iov.iov_len = sizeof(*icreq);
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+	if (ret < 0)
+		goto free_icresp;
+
+	memset(&msg, 0, sizeof(msg));
+	iov.iov_base = icresp;
+	iov.iov_len = sizeof(*icresp);
+	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (ret < 0)
+		goto free_icresp;
+
+	ret = -EINVAL;
+	if (icresp->hdr.type != nvme_tcp_icresp) {
+		pr_err("queue %d: bad type returned %d\n",
+			nvme_tcp_queue_id(queue), icresp->hdr.type);
+		goto free_icresp;
+	}
+
+	if (le32_to_cpu(icresp->hdr.plen) != sizeof(*icresp)) {
+		pr_err("queue %d: bad pdu length returned %d\n",
+			nvme_tcp_queue_id(queue), icresp->hdr.plen);
+		goto free_icresp;
+	}
+
+	if (icresp->pfv != NVME_TCP_PFV_1_0) {
+		pr_err("queue %d: bad pfv returned %d\n",
+			nvme_tcp_queue_id(queue), icresp->pfv);
+		goto free_icresp;
+	}
+
+	ctrl_ddgst = !!(icresp->digest & NVME_TCP_DATA_DIGEST_ENABLE);
+	if ((queue->data_digest && !ctrl_ddgst) ||
+	    (!queue->data_digest && ctrl_ddgst)) {
+		pr_err("queue %d: data digest mismatch host: %s ctrl: %s\n",
+			nvme_tcp_queue_id(queue),
+			queue->data_digest ? "enabled" : "disabled",
+			ctrl_ddgst ? "enabled" : "disabled");
+		goto free_icresp;
+	}
+
+	ctrl_hdgst = !!(icresp->digest & NVME_TCP_HDR_DIGEST_ENABLE);
+	if ((queue->hdr_digest && !ctrl_hdgst) ||
+	    (!queue->hdr_digest && ctrl_hdgst)) {
+		pr_err("queue %d: header digest mismatch host: %s ctrl: %s\n",
+			nvme_tcp_queue_id(queue),
+			queue->hdr_digest ? "enabled" : "disabled",
+			ctrl_hdgst ? "enabled" : "disabled");
+		goto free_icresp;
+	}
+
+	if (icresp->cpda != 0) {
+		pr_err("queue %d: unsupported cpda returned %d\n",
+			nvme_tcp_queue_id(queue), icresp->cpda);
+		goto free_icresp;
+	}
+
+	ret = 0;
+free_icresp:
+	kfree(icresp);
+free_icreq:
+	kfree(icreq);
+	return ret;
+}
+
+static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
+		int qid, size_t queue_size)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+	struct linger sol = { .l_onoff = 1, .l_linger = 0 };
+	int ret, opt, rcv_pdu_size;
+
+	queue->ctrl = ctrl;
+	INIT_LIST_HEAD(&queue->send_list);
+	spin_lock_init(&queue->lock);
+	INIT_WORK(&queue->io_work, nvme_tcp_io_work);
+	queue->queue_size = queue_size;
+
+	if (qid > 0)
+		queue->cmnd_capsule_len = ctrl->ctrl.ioccsz * 16;
+	else
+		queue->cmnd_capsule_len = sizeof(struct nvme_command) +
+						NVME_TCP_ADMIN_CCSZ;
+
+	ret = sock_create(ctrl->addr.ss_family, SOCK_STREAM,
+			IPPROTO_TCP, &queue->sock);
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to create socket: %d\n", ret);
+		return ret;
+	}
+
+	/* Single syn retry */
+	opt = 1;
+	ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_SYNCNT,
+			(char *)&opt, sizeof(opt));
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to set TCP_SYNCNT sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	/* Set TCP no delay */
+	opt = 1;
+	ret = kernel_setsockopt(queue->sock, IPPROTO_TCP,
+			TCP_NODELAY, (char *)&opt, sizeof(opt));
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to set TCP_NODELAY sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	/*
+	 * Cleanup whatever is sitting in the TCP transmit queue on socket
+	 * close. This is done to prevent stale data from being sent should
+	 * the network connection be restored before TCP times out.
+	 */
+	ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_LINGER,
+			(char *)&sol, sizeof(sol));
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to set SO_LINGER sock opt %d\n", ret);
+		goto err_sock;
+	}
+
+	queue->sock->sk->sk_allocation = GFP_ATOMIC;
+	queue->io_cpu = (qid == 0) ? 0 : qid - 1;
+	queue->request = NULL;
+	queue->rcv.data_remaining = 0;
+	queue->rcv.ddgst_remaining = 0;
+	queue->rcv.pdu_remaining = 0;
+	queue->rcv.pdu_offset = 0;
+	sk_set_memalloc(queue->sock->sk);
+
+	if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR) {
+		ret = kernel_bind(queue->sock, (struct sockaddr *)&ctrl->src_addr,
+			sizeof(ctrl->src_addr));
+		if (ret) {
+			dev_err(ctrl->ctrl.device,
+				"failed to bind queue %d socket %d\n",
+				qid, ret);
+			goto err_sock;
+		}
+	}
+
+	queue->hdr_digest = nctrl->opts->hdr_digest;
+	queue->data_digest = nctrl->opts->data_digest;
+	if (queue->hdr_digest || queue->data_digest) {
+		ret = nvme_tcp_alloc_crypto(queue);
+		if (ret) {
+			dev_err(ctrl->ctrl.device,
+				"failed to allocate queue %d crypto\n", qid);
+			goto err_sock;
+		}
+	}
+
+	rcv_pdu_size = sizeof(struct nvme_tcp_rsp_pdu) +
+			nvme_tcp_hdgst_len(queue);
+	queue->rcv.pdu = kmalloc(rcv_pdu_size, GFP_KERNEL);
+	if (!queue->rcv.pdu) {
+		ret = -ENOMEM;
+		goto err_crypto;
+	}
+
+	dev_dbg(ctrl->ctrl.device, "connecting queue %d\n",
+			nvme_tcp_queue_id(queue));
+
+	ret = kernel_connect(queue->sock, (struct sockaddr *)&ctrl->addr,
+		sizeof(ctrl->addr), 0);
+	if (ret) {
+		dev_err(ctrl->ctrl.device,
+			"failed to connect socket: %d\n", ret);
+		goto err_rcv_pdu;
+	}
+
+	ret = nvme_tcp_init_connection(queue);
+	if (ret)
+		goto err_init_connect;
+
+	queue->rd_enabled = true;
+	set_bit(NVME_TCP_Q_ALLOCATED, &queue->flags);
+	nvme_tcp_init_recv_ctx(queue);
+
+	write_lock_bh(&queue->sock->sk->sk_callback_lock);
+	queue->sock->sk->sk_user_data = queue;
+	queue->sc = queue->sock->sk->sk_state_change;
+	queue->dr = queue->sock->sk->sk_data_ready;
+	queue->ws = queue->sock->sk->sk_write_space;
+	queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
+	queue->sock->sk->sk_state_change = nvme_tcp_state_change;
+	queue->sock->sk->sk_write_space = nvme_tcp_write_space;
+	write_unlock_bh(&queue->sock->sk->sk_callback_lock);
+
+	return 0;
+
+err_init_connect:
+	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+err_rcv_pdu:
+	kfree(queue->rcv.pdu);
+err_crypto:
+	if (queue->hdr_digest || queue->data_digest)
+		nvme_tcp_free_crypto(queue);
+err_sock:
+	sock_release(queue->sock);
+	queue->sock = NULL;
+	return ret;
+}
+
+static void nvme_tcp_restore_sock_calls(struct nvme_tcp_queue *queue)
+{
+	struct socket *sock = queue->sock;
+
+	write_lock_bh(&sock->sk->sk_callback_lock);
+	sock->sk->sk_user_data  = NULL;
+	sock->sk->sk_data_ready = queue->dr;
+	sock->sk->sk_state_change = queue->sc;
+	sock->sk->sk_write_space  = queue->ws;
+	write_unlock_bh(&sock->sk->sk_callback_lock);
+}
+
+static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
+{
+	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+	nvme_tcp_restore_sock_calls(queue);
+	cancel_work_sync(&queue->io_work);
+}
+
+static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+	if (!test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags))
+		return;
+
+	__nvme_tcp_stop_queue(queue);
+}
+
+static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	int ret;
+
+	if (idx)
+		ret = nvmf_connect_io_queue(nctrl, idx);
+	else
+		ret = nvmf_connect_admin_queue(nctrl);
+
+	if (!ret) {
+		set_bit(NVME_TCP_Q_LIVE, &ctrl->queues[idx].flags);
+	} else {
+		__nvme_tcp_stop_queue(&ctrl->queues[idx]);
+		dev_err(nctrl->device,
+			"failed to connect queue: %d ret=%d\n", idx, ret);
+	}
+	return ret;
+}
+
+static void nvme_tcp_free_tagset(struct nvme_ctrl *nctrl,
+		struct blk_mq_tag_set *set)
+{
+	blk_mq_free_tag_set(set);
+}
+
+static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
+		bool admin)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+	struct blk_mq_tag_set *set;
+	int ret;
+
+	if (admin) {
+		set = &ctrl->admin_tag_set;
+		memset(set, 0, sizeof(*set));
+		set->ops = &nvme_tcp_admin_mq_ops;
+		set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+		set->reserved_tags = 2; /* connect + keep-alive */
+		set->numa_node = NUMA_NO_NODE;
+		set->cmd_size = sizeof(struct nvme_tcp_request);
+		set->driver_data = ctrl;
+		set->nr_hw_queues = 1;
+		set->timeout = ADMIN_TIMEOUT;
+	} else {
+		set = &ctrl->tag_set;
+		memset(set, 0, sizeof(*set));
+		set->ops = &nvme_tcp_mq_ops;
+		set->queue_depth = nctrl->sqsize + 1;
+		set->reserved_tags = 1; /* fabric connect */
+		set->numa_node = NUMA_NO_NODE;
+		set->flags = BLK_MQ_F_SHOULD_MERGE;
+		set->cmd_size = sizeof(struct nvme_tcp_request);
+		set->driver_data = ctrl;
+		set->nr_hw_queues = nctrl->queue_count - 1;
+		set->timeout = NVME_IO_TIMEOUT;
+	}
+
+	ret = blk_mq_alloc_tag_set(set);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return set;
+}
+
+static void nvme_tcp_free_admin_queue(struct nvme_ctrl *ctrl)
+{
+	if (to_tcp_ctrl(ctrl)->async_req.pdu) {
+		nvme_tcp_free_async_req(to_tcp_ctrl(ctrl));
+		to_tcp_ctrl(ctrl)->async_req.pdu = NULL;
+	}
+
+	nvme_tcp_free_queue(ctrl, 0);
+}
+
+static void nvme_tcp_free_io_queues(struct nvme_ctrl *ctrl)
+{
+	int i;
+
+	for (i = 1; i < ctrl->queue_count; i++)
+		nvme_tcp_free_queue(ctrl, i);
+}
+
+static void nvme_tcp_stop_admin_queue(struct nvme_ctrl *ctrl)
+{
+	nvme_tcp_stop_queue(ctrl, 0);
+}
+
+static void nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl)
+{
+	int i;
+
+	for (i = 1; i < ctrl->queue_count; i++)
+		nvme_tcp_stop_queue(ctrl, i);
+}
+
+static int nvme_tcp_start_admin_queue(struct nvme_ctrl *ctrl)
+{
+	return nvme_tcp_start_queue(ctrl, 0);
+}
+
+static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl)
+{
+	int i, ret = 0;
+
+	for (i = 1; i < ctrl->queue_count; i++) {
+		ret = nvme_tcp_start_queue(ctrl, i);
+		if (ret)
+			goto out_stop_queues;
+	}
+
+	return 0;
+
+out_stop_queues:
+	for (i--; i >= 1; i--)
+		nvme_tcp_stop_queue(ctrl, i);
+	return ret;
+}
+
+static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
+{
+	int ret;
+
+	ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
+	if (ret)
+		return ret;
+
+	ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
+	if (ret)
+		goto out_free_queue;
+
+	return 0;
+
+out_free_queue:
+	nvme_tcp_free_queue(ctrl, 0);
+	return ret;
+}
+
+static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+	int i, ret;
+
+	for (i = 1; i < ctrl->queue_count; i++) {
+		ret = nvme_tcp_alloc_queue(ctrl, i,
+				ctrl->sqsize + 1);
+		if (ret)
+			goto out_free_queues;
+	}
+
+	return 0;
+
+out_free_queues:
+	for (i--; i >= 1; i--)
+		nvme_tcp_free_queue(ctrl, i);
+
+	return ret;
+}
+
+static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl)
+{
+	return min(ctrl->queue_count - 1, num_online_cpus());
+}
+
+static int nvme_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+	unsigned int nr_io_queues;
+	int ret;
+
+	nr_io_queues = nvme_tcp_nr_io_queues(ctrl);
+	ret = nvme_set_queue_count(ctrl, &nr_io_queues);
+	if (ret)
+		return ret;
+
+	ctrl->queue_count = nr_io_queues + 1;
+	if (ctrl->queue_count < 2)
+		return 0;
+
+	dev_info(ctrl->device,
+		"creating %d I/O queues.\n", nr_io_queues);
+
+	return nvme_tcp_alloc_io_queues(ctrl);
+}
+
+void nvme_tcp_destroy_io_queues(struct nvme_ctrl *ctrl, bool remove)
+{
+	nvme_tcp_stop_io_queues(ctrl);
+	if (remove) {
+		if (ctrl->ops->flags & NVME_F_FABRICS)
+			blk_cleanup_queue(ctrl->connect_q);
+		nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+	}
+	nvme_tcp_free_io_queues(ctrl);
+}
+
+int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
+{
+	int ret;
+
+	ret = nvme_alloc_io_queues(ctrl);
+	if (ret)
+		return ret;
+
+	if (new) {
+		ctrl->tagset = nvme_tcp_alloc_tagset(ctrl, false);
+		if (IS_ERR(ctrl->tagset)) {
+			ret = PTR_ERR(ctrl->tagset);
+			goto out_free_io_queues;
+		}
+
+		if (ctrl->ops->flags & NVME_F_FABRICS) {
+			ctrl->connect_q = blk_mq_init_queue(ctrl->tagset);
+			if (IS_ERR(ctrl->connect_q)) {
+				ret = PTR_ERR(ctrl->connect_q);
+				goto out_free_tag_set;
+			}
+		}
+       } else {
+		blk_mq_update_nr_hw_queues(ctrl->tagset,
+			ctrl->queue_count - 1);
+       }
+
+	ret = nvme_tcp_start_io_queues(ctrl);
+	if (ret)
+		goto out_cleanup_connect_q;
+
+	return 0;
+
+out_cleanup_connect_q:
+	if (new && (ctrl->ops->flags & NVME_F_FABRICS))
+		blk_cleanup_queue(ctrl->connect_q);
+out_free_tag_set:
+       if (new)
+		nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+out_free_io_queues:
+	nvme_tcp_free_io_queues(ctrl);
+       return ret;
+}
+
+void nvme_tcp_destroy_admin_queue(struct nvme_ctrl *ctrl, bool remove)
+{
+	nvme_tcp_stop_admin_queue(ctrl);
+	if (remove) {
+		free_opal_dev(ctrl->opal_dev);
+		blk_cleanup_queue(ctrl->admin_q);
+		nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+	}
+	nvme_tcp_free_admin_queue(ctrl);
+}
+
+int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
+{
+	int error;
+
+	error = nvme_tcp_alloc_admin_queue(ctrl);
+	if (error)
+		return error;
+
+	if (new) {
+		ctrl->admin_tagset = nvme_tcp_alloc_tagset(ctrl, true);
+		if (IS_ERR(ctrl->admin_tagset)) {
+			error = PTR_ERR(ctrl->admin_tagset);
+			goto out_free_queue;
+		}
+
+		ctrl->admin_q = blk_mq_init_queue(ctrl->admin_tagset);
+		if (IS_ERR(ctrl->admin_q)) {
+			error = PTR_ERR(ctrl->admin_q);
+			goto out_free_tagset;
+		}
+	}
+
+	error = nvme_tcp_start_admin_queue(ctrl);
+	if (error)
+		goto out_cleanup_queue;
+
+	error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap);
+	if (error) {
+		dev_err(ctrl->device,
+			"prop_get NVME_REG_CAP failed\n");
+		goto out_stop_queue;
+	}
+
+	ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize);
+
+	error = nvme_enable_ctrl(ctrl, ctrl->cap);
+	if (error)
+		goto out_stop_queue;
+
+	error = nvme_init_identify(ctrl);
+	if (error)
+		goto out_stop_queue;
+
+	return 0;
+
+out_stop_queue:
+	nvme_tcp_stop_admin_queue(ctrl);
+out_cleanup_queue:
+	if (new)
+		blk_cleanup_queue(ctrl->admin_q);
+out_free_tagset:
+	if (new)
+		nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+out_free_queue:
+	nvme_tcp_free_admin_queue(ctrl);
+	return error;
+}
+
+static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
+		bool remove)
+{
+	blk_mq_quiesce_queue(ctrl->admin_q);
+	nvme_tcp_stop_admin_queue(ctrl);
+	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl);
+	blk_mq_unquiesce_queue(ctrl->admin_q);
+	nvme_tcp_destroy_admin_queue(ctrl, remove);
+}
+
+static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
+		bool remove)
+{
+	if (ctrl->queue_count > 1) {
+		nvme_stop_queues(ctrl);
+		nvme_tcp_stop_io_queues(ctrl);
+		blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl);
+		if (remove)
+			nvme_start_queues(ctrl);
+		nvme_tcp_destroy_io_queues(ctrl, remove);
+	}
+}
+
+static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
+{
+	/* If we are resetting/deleting then do nothing */
+	if (ctrl->state != NVME_CTRL_CONNECTING) {
+		WARN_ON_ONCE(ctrl->state == NVME_CTRL_NEW ||
+			ctrl->state == NVME_CTRL_LIVE);
+		return;
+	}
+
+	if (nvmf_should_reconnect(ctrl)) {
+		dev_info(ctrl->device, "Reconnecting in %d seconds...\n",
+			ctrl->opts->reconnect_delay);
+		queue_delayed_work(nvme_wq, &ctrl->connect_work,
+				ctrl->opts->reconnect_delay * HZ);
+	} else {
+		dev_info(ctrl->device, "Removing controller...\n");
+		nvme_delete_ctrl(ctrl);
+	}
+}
+
+static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
+{
+	struct nvmf_ctrl_options *opts = ctrl->opts;
+	int ret = -EINVAL;
+
+	ret = nvme_tcp_configure_admin_queue(ctrl, new);
+	if (ret)
+		return ret;
+
+	if (ctrl->icdoff) {
+		dev_err(ctrl->device, "icdoff is not supported!\n");
+		goto destroy_admin;
+	}
+
+	if (opts->queue_size > ctrl->sqsize + 1)
+		dev_warn(ctrl->device,
+			"queue_size %zu > ctrl sqsize %u, clamping down\n",
+			opts->queue_size, ctrl->sqsize + 1);
+
+	if (ctrl->sqsize + 1 > ctrl->maxcmd) {
+		dev_warn(ctrl->device,
+			"sqsize %u > ctrl maxcmd %u, clamping down\n",
+			ctrl->sqsize + 1, ctrl->maxcmd);
+		ctrl->sqsize = ctrl->maxcmd - 1;
+	}
+
+	if (ctrl->queue_count > 1) {
+		ret = nvme_tcp_configure_io_queues(ctrl, new);
+		if (ret)
+			goto destroy_admin;
+	}
+
+	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE)) {
+		/* state change failure is ok if we're in DELETING state */
+		WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+		ret = -EINVAL;
+		goto destroy_io;
+	}
+
+	nvme_start_ctrl(ctrl);
+	return 0;
+
+destroy_io:
+	if (ctrl->queue_count > 1)
+		nvme_tcp_destroy_io_queues(ctrl, new);
+destroy_admin:
+	nvme_tcp_stop_admin_queue(ctrl);
+	nvme_tcp_destroy_admin_queue(ctrl, new);
+	return ret;
+}
+
+static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+			struct nvme_ctrl, connect_work);
+
+	++ctrl->nr_reconnects;
+
+	if (nvme_tcp_setup_ctrl(ctrl, false))
+		goto requeue;
+
+	dev_info(ctrl->device, "Successfully reconnected (%d attepmpt)\n",
+			ctrl->nr_reconnects);
+
+	ctrl->nr_reconnects = 0;
+
+	return;
+
+requeue:
+	dev_info(ctrl->device, "Failed reconnect attempt %d\n",
+			ctrl->nr_reconnects);
+	nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_error_recovery_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl = container_of(work,
+			struct nvme_ctrl, err_work);
+
+	nvme_stop_keep_alive(ctrl);
+	nvme_tcp_teardown_io_queues(ctrl, false);
+	/* unquiesce to fail fast pending requests */
+	nvme_start_queues(ctrl);
+	nvme_tcp_teardown_admin_queue(ctrl, false);
+
+	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+		/* state change failure is ok if we're in DELETING state */
+		WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+		return;
+	}
+
+	nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
+{
+	nvme_tcp_teardown_io_queues(ctrl, shutdown);
+	if (shutdown)
+		nvme_shutdown_ctrl(ctrl);
+	else
+		nvme_disable_ctrl(ctrl, ctrl->cap);
+	nvme_tcp_teardown_admin_queue(ctrl, shutdown);
+}
+
+static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
+{
+	nvme_tcp_teardown_ctrl(ctrl, true);
+}
+
+static void nvme_reset_ctrl_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl =
+		container_of(work, struct nvme_ctrl, reset_work);
+
+	nvme_stop_ctrl(ctrl);
+	nvme_tcp_teardown_ctrl(ctrl, false);
+
+	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+		/* state change failure is ok if we're in DELETING state */
+		WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+		return;
+	}
+
+	if (nvme_tcp_setup_ctrl(ctrl, false))
+		goto out_fail;
+
+	return;
+
+out_fail:
+	++ctrl->nr_reconnects;
+	nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
+{
+	cancel_work_sync(&ctrl->err_work);
+	cancel_delayed_work_sync(&ctrl->connect_work);
+}
+
+static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+
+	if (list_empty(&ctrl->list))
+		goto free_ctrl;
+
+	mutex_lock(&nvme_tcp_ctrl_mutex);
+	list_del(&ctrl->list);
+	mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+	nvmf_free_options(nctrl->opts);
+free_ctrl:
+	kfree(ctrl->queues);
+	kfree(ctrl);
+}
+
+static void nvme_tcp_set_sg_null(struct nvme_command *c)
+{
+	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+	sg->addr = 0;
+	sg->length = 0;
+	sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+			NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_request *req, struct nvme_command *c)
+{
+	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+	sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
+	sg->length = cpu_to_le32(req->data_len);
+	sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
+}
+
+static void nvme_tcp_set_sg_host_data(struct nvme_tcp_request *req,
+		struct nvme_command *c)
+{
+	struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+	sg->addr = 0;
+	sg->length = cpu_to_le32(req->data_len);
+	sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+			NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
+{
+	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(arg);
+	struct nvme_tcp_queue *queue = &ctrl->queues[0];
+	struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
+	struct nvme_command *cmd = &pdu->cmd;
+	u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+	memset(pdu, 0, sizeof(*pdu));
+	pdu->hdr.type = nvme_tcp_cmd;
+	if (queue->hdr_digest)
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+
+	cmd->common.opcode = nvme_admin_async_event;
+	cmd->common.command_id = NVME_AQ_BLK_MQ_DEPTH;
+	cmd->common.flags |= NVME_CMD_SGL_METABUF;
+	nvme_tcp_set_sg_null(cmd);
+
+	ctrl->async_req.snd.state = NVME_TCP_SEND_CMD_PDU;
+	ctrl->async_req.snd.offset = 0;
+	ctrl->async_req.snd.curr_bio = NULL;
+	ctrl->async_req.rcv.curr_bio = NULL;
+	ctrl->async_req.data_len = 0;
+
+	nvme_tcp_queue_request(&ctrl->async_req);
+}
+
+static enum blk_eh_timer_return
+nvme_tcp_timeout(struct request *rq, bool reserved)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
+	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+
+	dev_dbg(ctrl->ctrl.device,
+		"queue %d: timeout request %#x type %d\n",
+		nvme_tcp_queue_id(req->queue), rq->tag,
+		pdu->hdr.type);
+
+	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+		union nvme_result res = {};
+
+		nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+		nvme_end_request(rq, NVME_SC_ABORT_REQ, res);
+		return BLK_EH_DONE;
+	}
+
+	/* queue error recovery */
+	nvme_tcp_error_recovery(&ctrl->ctrl);
+
+	return BLK_EH_RESET_TIMER;
+}
+
+static blk_status_t nvme_tcp_map_data(struct nvme_tcp_queue *queue,
+			struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+	struct nvme_command *c = &pdu->cmd;
+
+	c->common.flags |= NVME_CMD_SGL_METABUF;
+
+	if (!req->data_len) {
+		nvme_tcp_set_sg_null(c);
+		return 0;
+	}
+
+	if (rq_data_dir(rq) == WRITE &&
+	    req->data_len <= nvme_tcp_inline_data_size(queue))
+		nvme_tcp_set_sg_inline(queue, req, c);
+	else
+		nvme_tcp_set_sg_host_data(req, c);
+
+	return 0;
+}
+
+static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
+		struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+	struct nvme_tcp_queue *queue = req->queue;
+	u8 hdgst = nvme_tcp_hdgst_len(queue), ddgst = 0;
+	blk_status_t ret;
+
+	ret = nvme_setup_cmd(ns, rq, &pdu->cmd);
+	if (ret)
+		return ret;
+
+	req->snd.state = NVME_TCP_SEND_CMD_PDU;
+	req->snd.offset = 0;
+	req->snd.data_sent = 0;
+	req->pdu_len = 0;
+	req->pdu_sent = 0;
+	req->data_len = blk_rq_payload_bytes(rq);
+
+	if (rq_data_dir(rq) == WRITE) {
+		req->snd.curr_bio = rq->bio;
+		if (req->data_len <= nvme_tcp_inline_data_size(queue))
+			req->pdu_len = req->data_len;
+	} else {
+		req->rcv.curr_bio = rq->bio;
+		if (req->rcv.curr_bio)
+			nvme_tcp_init_recv_iter(req);
+	}
+
+	pdu->hdr.type = nvme_tcp_cmd;
+	pdu->hdr.flags = 0;
+	if (queue->hdr_digest)
+		pdu->hdr.flags |= NVME_TCP_F_HDGST;
+	if (queue->data_digest && req->pdu_len) {
+		pdu->hdr.flags |= NVME_TCP_F_DDGST;
+		ddgst = nvme_tcp_ddgst_len(queue);
+	}
+	pdu->hdr.hlen = sizeof(*pdu);
+	pdu->hdr.pdo = req->pdu_len ? pdu->hdr.hlen + hdgst : 0;
+	pdu->hdr.plen =
+		cpu_to_le32(pdu->hdr.hlen + hdgst + req->pdu_len + ddgst);
+
+	ret = nvme_tcp_map_data(queue, rq);
+	if (unlikely(ret)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"Failed to map data (%d)\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	struct nvme_ns *ns = hctx->queue->queuedata;
+	struct nvme_tcp_queue *queue = hctx->driver_data;
+	struct request *rq = bd->rq;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
+	blk_status_t ret;
+
+	if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
+		return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
+
+	ret = nvme_tcp_setup_cmd_pdu(ns, rq);
+	if (unlikely(ret))
+		return ret;
+
+	blk_mq_start_request(rq);
+
+	nvme_tcp_queue_request(req);
+
+	return BLK_STS_OK;
+}
+
+static struct blk_mq_ops nvme_tcp_mq_ops = {
+	.queue_rq	= nvme_tcp_queue_rq,
+	.complete	= nvme_complete_rq,
+	.init_request	= nvme_tcp_init_request,
+	.exit_request	= nvme_tcp_exit_request,
+	.init_hctx	= nvme_tcp_init_hctx,
+	.timeout	= nvme_tcp_timeout,
+};
+
+static struct blk_mq_ops nvme_tcp_admin_mq_ops = {
+	.queue_rq	= nvme_tcp_queue_rq,
+	.complete	= nvme_complete_rq,
+	.init_request	= nvme_tcp_init_request,
+	.exit_request	= nvme_tcp_exit_request,
+	.init_hctx	= nvme_tcp_init_admin_hctx,
+	.timeout	= nvme_tcp_timeout,
+};
+
+static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
+	.name			= "tcp",
+	.module			= THIS_MODULE,
+	.flags			= NVME_F_FABRICS,
+	.reg_read32		= nvmf_reg_read32,
+	.reg_read64		= nvmf_reg_read64,
+	.reg_write32		= nvmf_reg_write32,
+	.free_ctrl		= nvme_tcp_free_ctrl,
+	.submit_async_event	= nvme_tcp_submit_async_event,
+	.delete_ctrl		= nvme_tcp_delete_ctrl,
+	.get_address		= nvmf_get_address,
+	.stop_ctrl		= nvme_tcp_stop_ctrl,
+};
+
+static bool
+nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts)
+{
+	struct nvme_tcp_ctrl *ctrl;
+	bool found = false;
+
+	mutex_lock(&nvme_tcp_ctrl_mutex);
+	list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
+		found = nvmf_ip_options_match(&ctrl->ctrl, opts);
+		if (found)
+			break;
+	}
+	mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+	return found;
+}
+
+static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
+		struct nvmf_ctrl_options *opts)
+{
+	struct nvme_tcp_ctrl *ctrl;
+	int ret;
+
+	ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
+	if (!ctrl)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&ctrl->list);
+	ctrl->ctrl.opts = opts;
+	ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
+	ctrl->ctrl.sqsize = opts->queue_size - 1;
+	ctrl->ctrl.kato = opts->kato;
+
+	INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
+			nvme_tcp_reconnect_ctrl_work);
+	INIT_WORK(&ctrl->ctrl.err_work, nvme_tcp_error_recovery_work);
+	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
+
+	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
+		opts->trsvcid =
+			kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
+		if (!opts->trsvcid) {
+			ret = -ENOMEM;
+			goto out_free_ctrl;
+		}
+		opts->mask |= NVMF_OPT_TRSVCID;
+	}
+
+	ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+			opts->traddr, opts->trsvcid, &ctrl->addr);
+	if (ret) {
+		pr_err("malformed address passed: %s:%s\n",
+			opts->traddr, opts->trsvcid);
+		goto out_free_ctrl;
+	}
+
+	if (opts->mask & NVMF_OPT_HOST_TRADDR) {
+		ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+			opts->host_traddr, NULL, &ctrl->src_addr);
+		if (ret) {
+			pr_err("malformed src address passed: %s\n",
+			       opts->host_traddr);
+			goto out_free_ctrl;
+		}
+	}
+
+	if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) {
+		ret = -EALREADY;
+		goto out_free_ctrl;
+	}
+
+	ctrl->queues = kcalloc(opts->nr_io_queues + 1, sizeof(*ctrl->queues),
+				GFP_KERNEL);
+	if (!ctrl->queues) {
+		ret = -ENOMEM;
+		goto out_free_ctrl;
+	}
+
+	ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0);
+	if (ret)
+		goto out_kfree_queues;
+
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
+		WARN_ON_ONCE(1);
+		ret = -EINTR;
+		goto out_uninit_ctrl;
+	}
+
+	ret = nvme_tcp_setup_ctrl(&ctrl->ctrl, true);
+	if (ret)
+		goto out_uninit_ctrl;
+
+	dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+		ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
+
+	nvme_get_ctrl(&ctrl->ctrl);
+
+	mutex_lock(&nvme_tcp_ctrl_mutex);
+	list_add_tail(&ctrl->list, &nvme_tcp_ctrl_list);
+	mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+	return &ctrl->ctrl;
+
+out_uninit_ctrl:
+	nvme_uninit_ctrl(&ctrl->ctrl);
+	nvme_put_ctrl(&ctrl->ctrl);
+	if (ret > 0)
+		ret = -EIO;
+	return ERR_PTR(ret);
+out_kfree_queues:
+	kfree(ctrl->queues);
+out_free_ctrl:
+	kfree(ctrl);
+	return ERR_PTR(ret);
+}
+
+static struct nvmf_transport_ops nvme_tcp_transport = {
+	.name		= "tcp",
+	.module		= THIS_MODULE,
+	.required_opts	= NVMF_OPT_TRADDR,
+	.allowed_opts	= NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
+			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
+			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST,
+	.create_ctrl	= nvme_tcp_create_ctrl,
+};
+
+static int __init nvme_tcp_init_module(void)
+{
+	nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq",
+			WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+	if (!nvme_tcp_wq)
+		return -ENOMEM;
+
+	nvmf_register_transport(&nvme_tcp_transport);
+	return 0;
+}
+
+static void __exit nvme_tcp_cleanup_module(void)
+{
+	struct nvme_tcp_ctrl *ctrl;
+
+	nvmf_unregister_transport(&nvme_tcp_transport);
+
+	mutex_lock(&nvme_tcp_ctrl_mutex);
+	list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
+		nvme_delete_ctrl(&ctrl->ctrl);
+	mutex_unlock(&nvme_tcp_ctrl_mutex);
+	flush_workqueue(nvme_delete_wq);
+
+	destroy_workqueue(nvme_tcp_wq);
+}
+
+module_init(nvme_tcp_init_module);
+module_exit(nvme_tcp_cleanup_module);
+
+MODULE_LICENSE("GPL v2");
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH nvme-cli v2 15/14] nvme: Add TCP transport
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 fabrics.c    | 5 ++++-
 linux/nvme.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fabrics.c b/fabrics.c
index be6a18cd4787..b4fac97253ba 100644
--- a/fabrics.c
+++ b/fabrics.c
@@ -96,6 +96,7 @@ static const char *arg_str(const char * const *strings,
 static const char * const trtypes[] = {
 	[NVMF_TRTYPE_RDMA]	= "rdma",
 	[NVMF_TRTYPE_FC]	= "fibre-channel",
+	[NVMF_TRTYPE_TCP]	= "tcp",
 	[NVMF_TRTYPE_LOOP]	= "loop",
 };
 
@@ -703,11 +704,13 @@ retry:
 		/* we can safely ignore the rest of the entries */
 		break;
 	case NVMF_TRTYPE_RDMA:
+	case NVMF_TRTYPE_TCP:
 		switch (e->adrfam) {
 		case NVMF_ADDR_FAMILY_IP4:
 		case NVMF_ADDR_FAMILY_IP6:
 			/* FALLTHRU */
-			len = sprintf(p, ",transport=rdma");
+			len = sprintf(p, ",transport=%s",
+				e->trtype == NVMF_TRTYPE_RDMA ? "rdma" : "tcp");
 			if (len < 0)
 				return -EINVAL;
 			p += len;
diff --git a/linux/nvme.h b/linux/nvme.h
index a6a44b066267..7a600c791877 100644
--- a/linux/nvme.h
+++ b/linux/nvme.h
@@ -52,6 +52,7 @@ enum {
 enum {
 	NVMF_TRTYPE_RDMA	= 1,	/* RDMA */
 	NVMF_TRTYPE_FC		= 2,	/* Fibre Channel */
+	NVMF_TRTYPE_TCP		= 3,	/* TCP */
 	NVMF_TRTYPE_LOOP	= 254,	/* Reserved for host usage */
 	NVMF_TRTYPE_MAX,
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH nvme-cli v2 15/14] nvme: Add TCP transport
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 fabrics.c    | 5 ++++-
 linux/nvme.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fabrics.c b/fabrics.c
index be6a18cd4787..b4fac97253ba 100644
--- a/fabrics.c
+++ b/fabrics.c
@@ -96,6 +96,7 @@ static const char *arg_str(const char * const *strings,
 static const char * const trtypes[] = {
 	[NVMF_TRTYPE_RDMA]	= "rdma",
 	[NVMF_TRTYPE_FC]	= "fibre-channel",
+	[NVMF_TRTYPE_TCP]	= "tcp",
 	[NVMF_TRTYPE_LOOP]	= "loop",
 };
 
@@ -703,11 +704,13 @@ retry:
 		/* we can safely ignore the rest of the entries */
 		break;
 	case NVMF_TRTYPE_RDMA:
+	case NVMF_TRTYPE_TCP:
 		switch (e->adrfam) {
 		case NVMF_ADDR_FAMILY_IP4:
 		case NVMF_ADDR_FAMILY_IP6:
 			/* FALLTHRU */
-			len = sprintf(p, ",transport=rdma");
+			len = sprintf(p, ",transport=%s",
+				e->trtype == NVMF_TRTYPE_RDMA ? "rdma" : "tcp");
 			if (len < 0)
 				return -EINVAL;
 			p += len;
diff --git a/linux/nvme.h b/linux/nvme.h
index a6a44b066267..7a600c791877 100644
--- a/linux/nvme.h
+++ b/linux/nvme.h
@@ -52,6 +52,7 @@ enum {
 enum {
 	NVMF_TRTYPE_RDMA	= 1,	/* RDMA */
 	NVMF_TRTYPE_FC		= 2,	/* Fibre Channel */
+	NVMF_TRTYPE_TCP		= 3,	/* TCP */
 	NVMF_TRTYPE_LOOP	= 254,	/* Reserved for host usage */
 	NVMF_TRTYPE_MAX,
 };
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH nvme-cli v2 16/14] fabrics: add tcp port tsas decoding
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

tcp tsas include sectype indication for unsecured/tls ports.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 fabrics.c    | 14 ++++++++++++++
 linux/nvme.h | 10 ++++++++++
 2 files changed, 24 insertions(+)

diff --git a/fabrics.c b/fabrics.c
index b4fac97253ba..3df273894632 100644
--- a/fabrics.c
+++ b/fabrics.c
@@ -141,6 +141,16 @@ static inline const char *treq_str(__u8 treq)
 	return arg_str(treqs, ARRAY_SIZE(treqs), treq);
 }
 
+static const char * const sectypes[] = {
+	[NVMF_TCP_SECTYPE_NONE]		= "none",
+	[NVMF_TCP_SECTYPE_TLS]		= "tls",
+};
+
+static inline const char *sectype_str(__u8 sectype)
+{
+	return arg_str(sectypes, ARRAY_SIZE(sectypes), sectype);
+}
+
 static const char * const prtypes[] = {
 	[NVMF_RDMA_PRTYPE_NOT_SPECIFIED]	= "not specified",
 	[NVMF_RDMA_PRTYPE_IB]			= "infiniband",
@@ -450,6 +460,10 @@ static void print_discovery_log(struct nvmf_disc_rsp_page_hdr *log, int numrec)
 			printf("rdma_pkey: 0x%04x\n",
 				e->tsas.rdma.pkey);
 			break;
+		case NVMF_TRTYPE_TCP:
+			printf("sectype: %s\n",
+				sectype_str(e->tsas.tcp.sectype));
+			break;
 		}
 	}
 }
diff --git a/linux/nvme.h b/linux/nvme.h
index 7a600c791877..68000eb8c1dc 100644
--- a/linux/nvme.h
+++ b/linux/nvme.h
@@ -91,6 +91,13 @@ enum {
 	NVMF_RDMA_CMS_RDMA_CM	= 1, /* Sockets based endpoint addressing */
 };
 
+/* TCP port security type for  Discovery Log Page entry TSAS
+ */
+enum {
+	NVMF_TCP_SECTYPE_NONE	= 0, /* No Security */
+	NVMF_TCP_SECTYPE_TLS	= 1, /* Transport Layer Security */
+};
+
 #define NVME_AQ_DEPTH		32
 #define NVME_NR_AEN_COMMANDS	1
 #define NVME_AQ_BLK_MQ_DEPTH	(NVME_AQ_DEPTH - NVME_NR_AEN_COMMANDS)
@@ -1184,6 +1191,9 @@ struct nvmf_disc_rsp_page_entry {
 			__u16	pkey;
 			__u8	resv10[246];
 		} rdma;
+		struct tcp {
+			__u8	sectype;
+		} tcp;
 	} tsas;
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH nvme-cli v2 16/14] fabrics: add tcp port tsas decoding
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

tcp tsas include sectype indication for unsecured/tls ports.

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 fabrics.c    | 14 ++++++++++++++
 linux/nvme.h | 10 ++++++++++
 2 files changed, 24 insertions(+)

diff --git a/fabrics.c b/fabrics.c
index b4fac97253ba..3df273894632 100644
--- a/fabrics.c
+++ b/fabrics.c
@@ -141,6 +141,16 @@ static inline const char *treq_str(__u8 treq)
 	return arg_str(treqs, ARRAY_SIZE(treqs), treq);
 }
 
+static const char * const sectypes[] = {
+	[NVMF_TCP_SECTYPE_NONE]		= "none",
+	[NVMF_TCP_SECTYPE_TLS]		= "tls",
+};
+
+static inline const char *sectype_str(__u8 sectype)
+{
+	return arg_str(sectypes, ARRAY_SIZE(sectypes), sectype);
+}
+
 static const char * const prtypes[] = {
 	[NVMF_RDMA_PRTYPE_NOT_SPECIFIED]	= "not specified",
 	[NVMF_RDMA_PRTYPE_IB]			= "infiniband",
@@ -450,6 +460,10 @@ static void print_discovery_log(struct nvmf_disc_rsp_page_hdr *log, int numrec)
 			printf("rdma_pkey: 0x%04x\n",
 				e->tsas.rdma.pkey);
 			break;
+		case NVMF_TRTYPE_TCP:
+			printf("sectype: %s\n",
+				sectype_str(e->tsas.tcp.sectype));
+			break;
 		}
 	}
 }
diff --git a/linux/nvme.h b/linux/nvme.h
index 7a600c791877..68000eb8c1dc 100644
--- a/linux/nvme.h
+++ b/linux/nvme.h
@@ -91,6 +91,13 @@ enum {
 	NVMF_RDMA_CMS_RDMA_CM	= 1, /* Sockets based endpoint addressing */
 };
 
+/* TCP port security type for  Discovery Log Page entry TSAS
+ */
+enum {
+	NVMF_TCP_SECTYPE_NONE	= 0, /* No Security */
+	NVMF_TCP_SECTYPE_TLS	= 1, /* Transport Layer Security */
+};
+
 #define NVME_AQ_DEPTH		32
 #define NVME_NR_AEN_COMMANDS	1
 #define NVME_AQ_BLK_MQ_DEPTH	(NVME_AQ_DEPTH - NVME_NR_AEN_COMMANDS)
@@ -1184,6 +1191,9 @@ struct nvmf_disc_rsp_page_entry {
 			__u16	pkey;
 			__u8	resv10[246];
 		} rdma;
+		struct tcp {
+			__u8	sectype;
+		} tcp;
 	} tsas;
 };
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH nvme-cli v2 17/14] fabrics: add transport header and data digest
  2018-11-20  3:00 ` Sagi Grimberg
@ 2018-11-20  3:00   ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

From: Sagi Grimberg <sagi@lightbitslabs.com>

This setting is enabling header and data digest over NVMe/TCP
controllers.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 fabrics.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/fabrics.c b/fabrics.c
index 3df273894632..1995f15c5e7b 100644
--- a/fabrics.c
+++ b/fabrics.c
@@ -62,6 +62,8 @@ static struct config {
 	int  duplicate_connect;
 	int  disable_sqflow;
 	bool persistent;
+	int  hdr_digest;
+	int  data_digest;
 } cfg = { NULL };
 
 #define BUF_SIZE		4096
@@ -629,7 +631,9 @@ static int build_options(char *argstr, int max_len)
 	    add_bool_argument(&argstr, &max_len, "duplicate_connect",
 				cfg.duplicate_connect) ||
 	    add_bool_argument(&argstr, &max_len, "disable_sqflow",
-				cfg.disable_sqflow))
+				cfg.disable_sqflow) ||
+	    add_bool_argument(&argstr, &max_len, "hdr_digest", cfg.hdr_digest) ||
+	    add_bool_argument(&argstr, &max_len, "data_digest", cfg.data_digest))
 		return -EINVAL;
 
 	return 0;
@@ -710,6 +714,20 @@ retry:
 		p += len;
 	}
 
+	if (cfg.hdr_digest) {
+		len = sprintf(p, ",hdr_digest");
+		if (len < 0)
+			return -EINVAL;
+		p += len;
+	}
+
+	if (cfg.data_digest) {
+		len = sprintf(p, ",data_digest");
+		if (len < 0)
+			return -EINVAL;
+		p += len;
+	}
+
 	switch (e->trtype) {
 	case NVMF_TRTYPE_LOOP: /* loop */
 		len = sprintf(p, ",transport=loop");
@@ -957,6 +975,8 @@ int discover(const char *desc, int argc, char **argv, bool connect)
 		{"reconnect-delay", 'c', "LIST", CFG_INT, &cfg.reconnect_delay, required_argument, "reconnect timeout period in seconds" },
 		{"ctrl-loss-tmo",   'l', "LIST", CFG_INT, &cfg.ctrl_loss_tmo,   required_argument, "controller loss timeout period in seconds" },
 		{"persistent",  'p', "LIST", CFG_NONE, &cfg.persistent,  no_argument, "persistent discovery connection" },
+		{"hdr_digest", 'g', "", CFG_NONE, &cfg.hdr_digest, no_argument, "enable transport protocol header digest (TCP transport)" },
+		{"data_digest", 'G', "", CFG_NONE, &cfg.data_digest, no_argument, "enable transport protocol data digest (TCP transport)" },
 		{NULL},
 	};
 
@@ -1000,6 +1020,8 @@ int connect(const char *desc, int argc, char **argv)
 		{"ctrl-loss-tmo",   'l', "LIST", CFG_INT, &cfg.ctrl_loss_tmo,   required_argument, "controller loss timeout period in seconds" },
 		{"duplicate_connect", 'D', "", CFG_NONE, &cfg.duplicate_connect, no_argument, "allow duplicate connections between same transport host and subsystem port" },
 		{"disable_sqflow", 'd', "", CFG_NONE, &cfg.disable_sqflow, no_argument, "disable controller sq flow control (default false)" },
+		{"hdr_digest", 'g', "", CFG_NONE, &cfg.hdr_digest, no_argument, "enable transport protocol header digest (TCP transport)" },
+		{"data_digest", 'G', "", CFG_NONE, &cfg.data_digest, no_argument, "enable transport protocol data digest (TCP transport)" },
 		{NULL},
 	};
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH nvme-cli v2 17/14] fabrics: add transport header and data digest
@ 2018-11-20  3:00   ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20  3:00 UTC (permalink / raw)


From: Sagi Grimberg <sagi@lightbitslabs.com>

This setting is enabling header and data digest over NVMe/TCP
controllers.

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
---
 fabrics.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/fabrics.c b/fabrics.c
index 3df273894632..1995f15c5e7b 100644
--- a/fabrics.c
+++ b/fabrics.c
@@ -62,6 +62,8 @@ static struct config {
 	int  duplicate_connect;
 	int  disable_sqflow;
 	bool persistent;
+	int  hdr_digest;
+	int  data_digest;
 } cfg = { NULL };
 
 #define BUF_SIZE		4096
@@ -629,7 +631,9 @@ static int build_options(char *argstr, int max_len)
 	    add_bool_argument(&argstr, &max_len, "duplicate_connect",
 				cfg.duplicate_connect) ||
 	    add_bool_argument(&argstr, &max_len, "disable_sqflow",
-				cfg.disable_sqflow))
+				cfg.disable_sqflow) ||
+	    add_bool_argument(&argstr, &max_len, "hdr_digest", cfg.hdr_digest) ||
+	    add_bool_argument(&argstr, &max_len, "data_digest", cfg.data_digest))
 		return -EINVAL;
 
 	return 0;
@@ -710,6 +714,20 @@ retry:
 		p += len;
 	}
 
+	if (cfg.hdr_digest) {
+		len = sprintf(p, ",hdr_digest");
+		if (len < 0)
+			return -EINVAL;
+		p += len;
+	}
+
+	if (cfg.data_digest) {
+		len = sprintf(p, ",data_digest");
+		if (len < 0)
+			return -EINVAL;
+		p += len;
+	}
+
 	switch (e->trtype) {
 	case NVMF_TRTYPE_LOOP: /* loop */
 		len = sprintf(p, ",transport=loop");
@@ -957,6 +975,8 @@ int discover(const char *desc, int argc, char **argv, bool connect)
 		{"reconnect-delay", 'c', "LIST", CFG_INT, &cfg.reconnect_delay, required_argument, "reconnect timeout period in seconds" },
 		{"ctrl-loss-tmo",   'l', "LIST", CFG_INT, &cfg.ctrl_loss_tmo,   required_argument, "controller loss timeout period in seconds" },
 		{"persistent",  'p', "LIST", CFG_NONE, &cfg.persistent,  no_argument, "persistent discovery connection" },
+		{"hdr_digest", 'g', "", CFG_NONE, &cfg.hdr_digest, no_argument, "enable transport protocol header digest (TCP transport)" },
+		{"data_digest", 'G', "", CFG_NONE, &cfg.data_digest, no_argument, "enable transport protocol data digest (TCP transport)" },
 		{NULL},
 	};
 
@@ -1000,6 +1020,8 @@ int connect(const char *desc, int argc, char **argv)
 		{"ctrl-loss-tmo",   'l', "LIST", CFG_INT, &cfg.ctrl_loss_tmo,   required_argument, "controller loss timeout period in seconds" },
 		{"duplicate_connect", 'D', "", CFG_NONE, &cfg.duplicate_connect, no_argument, "allow duplicate connections between same transport host and subsystem port" },
 		{"disable_sqflow", 'd', "", CFG_NONE, &cfg.disable_sqflow, no_argument, "disable controller sq flow control (default false)" },
+		{"hdr_digest", 'g', "", CFG_NONE, &cfg.hdr_digest, no_argument, "enable transport protocol header digest (TCP transport)" },
+		{"data_digest", 'G', "", CFG_NONE, &cfg.data_digest, no_argument, "enable transport protocol data digest (TCP transport)" },
 		{NULL},
 	};
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH nvme-cli v2 15/14] nvme: Add TCP transport
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-20  9:36     ` Arend van Spriel
  -1 siblings, 0 replies; 76+ messages in thread
From: Arend van Spriel @ 2018-11-20  9:36 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

On 11/20/2018 4:00 AM, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi@lightbitslabs.com>
>
> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
> ---
>  fabrics.c    | 5 ++++-
>  linux/nvme.h | 1 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fabrics.c b/fabrics.c
> index be6a18cd4787..b4fac97253ba 100644
> --- a/fabrics.c
> +++ b/fabrics.c
> @@ -96,6 +96,7 @@ static const char *arg_str(const char * const *strings,
>  static const char * const trtypes[] = {
>  	[NVMF_TRTYPE_RDMA]	= "rdma",
>  	[NVMF_TRTYPE_FC]	= "fibre-channel",
> +	[NVMF_TRTYPE_TCP]	= "tcp",
>  	[NVMF_TRTYPE_LOOP]	= "loop",
>  };
>
> @@ -703,11 +704,13 @@ retry:
>  		/* we can safely ignore the rest of the entries */
>  		break;
>  	case NVMF_TRTYPE_RDMA:
> +	case NVMF_TRTYPE_TCP:
>  		switch (e->adrfam) {
>  		case NVMF_ADDR_FAMILY_IP4:
>  		case NVMF_ADDR_FAMILY_IP6:
>  			/* FALLTHRU */
> -			len = sprintf(p, ",transport=rdma");
> +			len = sprintf(p, ",transport=%s",
> +				e->trtype == NVMF_TRTYPE_RDMA ? "rdma" : "tcp");

So why not just use the trtypes array above?

Regards,
Arend

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH nvme-cli v2 15/14] nvme: Add TCP transport
@ 2018-11-20  9:36     ` Arend van Spriel
  0 siblings, 0 replies; 76+ messages in thread
From: Arend van Spriel @ 2018-11-20  9:36 UTC (permalink / raw)


On 11/20/2018 4:00 AM, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi at lightbitslabs.com>
>
> Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
> ---
>  fabrics.c    | 5 ++++-
>  linux/nvme.h | 1 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fabrics.c b/fabrics.c
> index be6a18cd4787..b4fac97253ba 100644
> --- a/fabrics.c
> +++ b/fabrics.c
> @@ -96,6 +96,7 @@ static const char *arg_str(const char * const *strings,
>  static const char * const trtypes[] = {
>  	[NVMF_TRTYPE_RDMA]	= "rdma",
>  	[NVMF_TRTYPE_FC]	= "fibre-channel",
> +	[NVMF_TRTYPE_TCP]	= "tcp",
>  	[NVMF_TRTYPE_LOOP]	= "loop",
>  };
>
> @@ -703,11 +704,13 @@ retry:
>  		/* we can safely ignore the rest of the entries */
>  		break;
>  	case NVMF_TRTYPE_RDMA:
> +	case NVMF_TRTYPE_TCP:
>  		switch (e->adrfam) {
>  		case NVMF_ADDR_FAMILY_IP4:
>  		case NVMF_ADDR_FAMILY_IP6:
>  			/* FALLTHRU */
> -			len = sprintf(p, ",transport=rdma");
> +			len = sprintf(p, ",transport=%s",
> +				e->trtype == NVMF_TRTYPE_RDMA ? "rdma" : "tcp");

So why not just use the trtypes array above?

Regards,
Arend

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH nvme-cli v2 15/14] nvme: Add TCP transport
  2018-11-20  9:36     ` Arend van Spriel
@ 2018-11-20 22:56       ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20 22:56 UTC (permalink / raw)
  To: Arend van Spriel, linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig


>> @@ -703,11 +704,13 @@ retry:
>>          /* we can safely ignore the rest of the entries */
>>          break;
>>      case NVMF_TRTYPE_RDMA:
>> +    case NVMF_TRTYPE_TCP:
>>          switch (e->adrfam) {
>>          case NVMF_ADDR_FAMILY_IP4:
>>          case NVMF_ADDR_FAMILY_IP6:
>>              /* FALLTHRU */
>> -            len = sprintf(p, ",transport=rdma");
>> +            len = sprintf(p, ",transport=%s",
>> +                e->trtype == NVMF_TRTYPE_RDMA ? "rdma" : "tcp");
> 
> So why not just use the trtypes array above?

We can use that...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH nvme-cli v2 15/14] nvme: Add TCP transport
@ 2018-11-20 22:56       ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-20 22:56 UTC (permalink / raw)



>> @@ -703,11 +704,13 @@ retry:
>> ???????? /* we can safely ignore the rest of the entries */
>> ???????? break;
>> ???? case NVMF_TRTYPE_RDMA:
>> +??? case NVMF_TRTYPE_TCP:
>> ???????? switch (e->adrfam) {
>> ???????? case NVMF_ADDR_FAMILY_IP4:
>> ???????? case NVMF_ADDR_FAMILY_IP6:
>> ???????????? /* FALLTHRU */
>> -??????????? len = sprintf(p, ",transport=rdma");
>> +??????????? len = sprintf(p, ",transport=%s",
>> +??????????????? e->trtype == NVMF_TRTYPE_RDMA ? "rdma" : "tcp");
> 
> So why not just use the trtypes array above?

We can use that...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-20  3:00   ` Sagi Grimberg
  (?)
@ 2018-11-20 23:34     ` Narayan Ayalasomayajula
  -1 siblings, 0 replies; 76+ messages in thread
From: Narayan Ayalasomayajula @ 2018-11-20 23:34 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: linux-block, netdev, Keith Busch, David S. Miller, Christoph Hellwig

Hi Sagi,

>+       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
>+       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
>+       icreq->hpda = 0; /* no alignment constraint */

The NVMe-TCP spec indicates that MAXR2T is a 0's-based value. To support a single inflight R2T as indicated in the comment above, icreq->maxr2t should be set to 0, right? 

Thanks,
Narayan

-----Original Message-----
From: Linux-nvme <linux-nvme-bounces@lists.infradead.org> On Behalf Of Sagi Grimberg
Sent: Monday, November 19, 2018 7:00 PM
To: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org; netdev@vger.kernel.org; Keith Busch <keith.busch@intel.com>; David S. Miller <davem@davemloft.net>; Christoph Hellwig <hch@lst.de>
Subject: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver

[EXTERNAL EMAIL]
This email was received from outside the organization.
________________________________

From: Sagi Grimberg <sagi@lightbitslabs.com>

This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:

        nvme connect-all -t tcp -a $IPADDR

This requires the latest version of nvme-cli with TCP support.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
---
 drivers/nvme/host/Kconfig  |   15 +
 drivers/nvme/host/Makefile |    3 +
 drivers/nvme/host/tcp.c    | 2306 ++++++++++++++++++++++++++++++++++++
 3 files changed, 2324 insertions(+)
 create mode 100644 drivers/nvme/host/tcp.c

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..0f345e207675 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -57,3 +57,18 @@ config NVME_FC
          from https://github.com/linux-nvme/nvme-cli.

          If unsure, say N.
+
+config NVME_TCP
+       tristate "NVM Express over Fabrics TCP host driver"
+       depends on INET
+       depends on BLK_DEV_NVME
+       select NVME_FABRICS
+       help
+         This provides support for the NVMe over Fabrics protocol using
+         the TCP transport.  This allows you to use remote block devices
+         exported using the NVMe protocol set.
+
+         To configure a NVMe over Fabrics controller use the nvme-cli tool
+         from https://github.com/linux-nvme/nvme-cli.
+
+         If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index aea459c65ae1..8a4b671c5f0c 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME)              += nvme.o
 obj-$(CONFIG_NVME_FABRICS)             += nvme-fabrics.o
 obj-$(CONFIG_NVME_RDMA)                        += nvme-rdma.o
 obj-$(CONFIG_NVME_FC)                  += nvme-fc.o
+obj-$(CONFIG_NVME_TCP)                 += nvme-tcp.o

 nvme-core-y                            := core.o
 nvme-core-$(CONFIG_TRACING)            += trace.o
@@ -21,3 +22,5 @@ nvme-fabrics-y                                += fabrics.o
 nvme-rdma-y                            += rdma.o

 nvme-fc-y                              += fc.o
+
+nvme-tcp-y                             += tcp.o
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
new file mode 100644
index 000000000000..4c583859f0ad
--- /dev/null
+++ b/drivers/nvme/host/tcp.c
@@ -0,0 +1,2306 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP host.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/nvme-tcp.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <linux/blk-mq.h>
+#include <crypto/hash.h>
+
+#include "nvme.h"
+#include "fabrics.h"
+
+struct nvme_tcp_queue;
+
+enum nvme_tcp_send_state {
+       NVME_TCP_SEND_CMD_PDU = 0,
+       NVME_TCP_SEND_H2C_PDU,
+       NVME_TCP_SEND_DATA,
+       NVME_TCP_SEND_DDGST,
+};
+
+struct nvme_tcp_send_ctx {
+       struct bio              *curr_bio;
+       struct iov_iter         iter;
+       size_t                  offset;
+       size_t                  data_sent;
+       enum nvme_tcp_send_state state;
+};
+
+struct nvme_tcp_recv_ctx {
+       struct iov_iter         iter;
+       struct bio              *curr_bio;
+};
+
+struct nvme_tcp_request {
+       struct nvme_request     req;
+       void                    *pdu;
+       struct nvme_tcp_queue   *queue;
+       u32                     data_len;
+       u32                     pdu_len;
+       u32                     pdu_sent;
+       u16                     ttag;
+       struct list_head        entry;
+       struct nvme_tcp_recv_ctx rcv;
+       struct nvme_tcp_send_ctx snd;
+       u32                     ddgst;
+};
+
+enum nvme_tcp_queue_flags {
+       NVME_TCP_Q_ALLOCATED    = 0,
+       NVME_TCP_Q_LIVE         = 1,
+};
+
+enum nvme_tcp_recv_state {
+       NVME_TCP_RECV_PDU = 0,
+       NVME_TCP_RECV_DATA,
+       NVME_TCP_RECV_DDGST,
+};
+
+struct nvme_tcp_queue_recv_ctx {
+       char            *pdu;
+       int             pdu_remaining;
+       int             pdu_offset;
+       size_t          data_remaining;
+       size_t          ddgst_remaining;
+};
+
+struct nvme_tcp_ctrl;
+struct nvme_tcp_queue {
+       struct socket           *sock;
+       struct work_struct      io_work;
+       int                     io_cpu;
+
+       spinlock_t              lock;
+       struct list_head        send_list;
+
+       int                     queue_size;
+       size_t                  cmnd_capsule_len;
+       struct nvme_tcp_ctrl    *ctrl;
+       unsigned long           flags;
+       bool                    rd_enabled;
+
+       struct nvme_tcp_queue_recv_ctx rcv;
+       struct nvme_tcp_request *request;
+
+       bool                    hdr_digest;
+       bool                    data_digest;
+       struct ahash_request    *rcv_hash;
+       struct ahash_request    *snd_hash;
+       __le32                  exp_ddgst;
+       __le32                  recv_ddgst;
+
+       struct page_frag_cache  pf_cache;
+
+       void (*sc)(struct sock *);
+       void (*dr)(struct sock *);
+       void (*ws)(struct sock *);
+};
+
+struct nvme_tcp_ctrl {
+       /* read only in the hot path */
+       struct nvme_tcp_queue   *queues;
+       struct blk_mq_tag_set   tag_set;
+
+       /* other member variables */
+       struct list_head        list;
+       struct blk_mq_tag_set   admin_tag_set;
+       struct sockaddr_storage addr;
+       struct sockaddr_storage src_addr;
+       struct nvme_ctrl        ctrl;
+
+       struct nvme_tcp_request async_req;
+};
+
+static LIST_HEAD(nvme_tcp_ctrl_list);
+static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
+static struct workqueue_struct *nvme_tcp_wq;
+static struct blk_mq_ops nvme_tcp_mq_ops;
+static struct blk_mq_ops nvme_tcp_admin_mq_ops;
+
+static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
+{
+       return container_of(ctrl, struct nvme_tcp_ctrl, ctrl);
+}
+
+static inline int nvme_tcp_queue_id(struct nvme_tcp_queue *queue)
+{
+       return queue - queue->ctrl->queues;
+}
+
+static inline struct blk_mq_tags *nvme_tcp_tagset(struct nvme_tcp_queue *queue)
+{
+       u32 queue_idx = nvme_tcp_queue_id(queue);
+
+       if (queue_idx == 0)
+               return queue->ctrl->admin_tag_set.tags[queue_idx];
+       return queue->ctrl->tag_set.tags[queue_idx - 1];
+}
+
+static inline u8 nvme_tcp_hdgst_len(struct nvme_tcp_queue *queue)
+{
+       return queue->hdr_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline u8 nvme_tcp_ddgst_len(struct nvme_tcp_queue *queue)
+{
+       return queue->data_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline size_t nvme_tcp_inline_data_size(struct nvme_tcp_queue *queue)
+{
+       return queue->cmnd_capsule_len - sizeof(struct nvme_command);
+}
+
+static inline bool nvme_tcp_async_req(struct nvme_tcp_request *req)
+{
+       return req == &req->queue->ctrl->async_req;
+}
+
+static inline bool nvme_tcp_has_inline_data(struct nvme_tcp_request *req)
+{
+       struct request *rq;
+       unsigned int bytes;
+
+       if (unlikely(nvme_tcp_async_req(req)))
+               return false; /* async events don't have a request */
+
+       rq = blk_mq_rq_from_pdu(req);
+       bytes = blk_rq_payload_bytes(rq);
+
+       return rq_data_dir(rq) == WRITE && bytes &&
+               bytes <= nvme_tcp_inline_data_size(req->queue);
+}
+
+static inline struct page *nvme_tcp_req_cur_page(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.bvec->bv_page;
+}
+
+static inline size_t nvme_tcp_req_cur_offset(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.bvec->bv_offset + req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req)
+{
+       return min_t(size_t, req->snd.iter.bvec->bv_len - req->snd.iter.iov_offset,
+                       req->pdu_len - req->pdu_sent);
+}
+
+static inline size_t nvme_tcp_req_offset(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req)
+{
+       return rq_data_dir(blk_mq_rq_from_pdu(req)) == WRITE ?
+                       req->pdu_len - req->pdu_sent : 0;
+}
+
+static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
+               int len)
+{
+       return nvme_tcp_pdu_data_left(req) <= len;
+}
+
+static void nvme_tcp_init_send_iter(struct nvme_tcp_request *req)
+{
+       struct request *rq = blk_mq_rq_from_pdu(req);
+       struct bio_vec *vec;
+       unsigned int size;
+       int nsegs;
+       size_t offset;
+
+       if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
+               vec = &rq->special_vec;
+               nsegs = 1;
+               size = blk_rq_payload_bytes(rq);
+               offset = 0;
+       } else {
+               struct bio *bio = req->snd.curr_bio;
+
+               vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+               nsegs = bio_segments(bio);
+               size = bio->bi_iter.bi_size;
+               offset = bio->bi_iter.bi_bvec_done;
+       }
+
+       iov_iter_bvec(&req->snd.iter, WRITE, vec, nsegs, size);
+       req->snd.iter.iov_offset = offset;
+}
+
+static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
+               int len)
+{
+       req->snd.data_sent += len;
+       req->pdu_sent += len;
+       iov_iter_advance(&req->snd.iter, len);
+       if (!iov_iter_count(&req->snd.iter) &&
+           req->snd.data_sent < req->data_len) {
+               req->snd.curr_bio = req->snd.curr_bio->bi_next;
+               nvme_tcp_init_send_iter(req);
+       }
+}
+
+static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+
+       spin_lock_bh(&queue->lock);
+       list_add_tail(&req->entry, &queue->send_list);
+       spin_unlock_bh(&queue->lock);
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static inline struct nvme_tcp_request *
+nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_request *req;
+
+       spin_lock_bh(&queue->lock);
+       req = list_first_entry_or_null(&queue->send_list,
+                       struct nvme_tcp_request, entry);
+       if (req)
+               list_del(&req->entry);
+       spin_unlock_bh(&queue->lock);
+
+       return req;
+}
+
+static inline void nvme_tcp_ddgst_final(struct ahash_request *hash, u32 *dgst)
+{
+       ahash_request_set_crypt(hash, NULL, (u8 *)dgst, 0);
+       crypto_ahash_final(hash);
+}
+
+static inline void nvme_tcp_ddgst_update(struct ahash_request *hash,
+               struct page *page, off_t off, size_t len)
+{
+       struct scatterlist sg;
+
+       sg_init_marker(&sg, 1);
+       sg_set_page(&sg, page, len, off);
+       ahash_request_set_crypt(hash, &sg, NULL, len);
+       crypto_ahash_update(hash);
+}
+
+static inline void nvme_tcp_hdgst(struct ahash_request *hash,
+               void *pdu, size_t len)
+{
+       struct scatterlist sg;
+
+       sg_init_one(&sg, pdu, len);
+       ahash_request_set_crypt(hash, &sg, pdu + len, len);
+       crypto_ahash_digest(hash);
+}
+
+static int nvme_tcp_verify_hdgst(struct nvme_tcp_queue *queue,
+       void *pdu, size_t pdu_len)
+{
+       struct nvme_tcp_hdr *hdr = pdu;
+       __le32 recv_digest;
+       __le32 exp_digest;
+
+       if (unlikely(!(hdr->flags & NVME_TCP_F_HDGST))) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d: header digest flag is cleared\n",
+                       nvme_tcp_queue_id(queue));
+               return -EPROTO;
+       }
+
+       recv_digest = *(__le32 *)(pdu + hdr->hlen);
+       nvme_tcp_hdgst(queue->rcv_hash, pdu, pdu_len);
+       exp_digest = *(__le32 *)(pdu + hdr->hlen);
+       if (recv_digest != exp_digest) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "header digest error: recv %#x expected %#x\n",
+                       le32_to_cpu(recv_digest), le32_to_cpu(exp_digest));
+               return -EIO;
+       }
+
+       return 0;
+}
+
+static int nvme_tcp_check_ddgst(struct nvme_tcp_queue *queue, void *pdu)
+{
+       struct nvme_tcp_hdr *hdr = pdu;
+       u32 len;
+
+       len = le32_to_cpu(hdr->plen) - hdr->hlen -
+               ((hdr->flags & NVME_TCP_F_HDGST) ? nvme_tcp_hdgst_len(queue) : 0);
+
+       if (unlikely(len && !(hdr->flags & NVME_TCP_F_DDGST))) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d: data digest flag is cleared\n",
+               nvme_tcp_queue_id(queue));
+               return -EPROTO;
+       }
+       crypto_ahash_init(queue->rcv_hash);
+
+       return 0;
+}
+
+static void nvme_tcp_exit_request(struct blk_mq_tag_set *set,
+               struct request *rq, unsigned int hctx_idx)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+       page_frag_free(req->pdu);
+}
+
+static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
+               struct request *rq, unsigned int hctx_idx,
+               unsigned int numa_node)
+{
+       struct nvme_tcp_ctrl *ctrl = set->driver_data;
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       int queue_idx = (set == &ctrl->tag_set) ? hctx_idx + 1 : 0;
+       struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       req->pdu = page_frag_alloc(&queue->pf_cache,
+               sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+               GFP_KERNEL | __GFP_ZERO);
+       if (!req->pdu)
+               return -ENOMEM;
+
+       req->queue = queue;
+       nvme_req(rq)->ctrl = &ctrl->ctrl;
+
+       return 0;
+}
+
+static int nvme_tcp_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+               unsigned int hctx_idx)
+{
+       struct nvme_tcp_ctrl *ctrl = data;
+       struct nvme_tcp_queue *queue = &ctrl->queues[hctx_idx + 1];
+
+       BUG_ON(hctx_idx >= ctrl->ctrl.queue_count);
+
+       hctx->driver_data = queue;
+       return 0;
+}
+
+static int nvme_tcp_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+               unsigned int hctx_idx)
+{
+       struct nvme_tcp_ctrl *ctrl = data;
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+
+       BUG_ON(hctx_idx != 0);
+
+       hctx->driver_data = queue;
+       return 0;
+}
+
+static enum nvme_tcp_recv_state nvme_tcp_recv_state(struct nvme_tcp_queue *queue)
+{
+       return  (queue->rcv.pdu_remaining) ? NVME_TCP_RECV_PDU :
+               (queue->rcv.ddgst_remaining) ? NVME_TCP_RECV_DDGST :
+               NVME_TCP_RECV_DATA;
+}
+
+static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+
+       rcv->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
+                               nvme_tcp_hdgst_len(queue);
+       rcv->pdu_offset = 0;
+       rcv->data_remaining = -1;
+       rcv->ddgst_remaining = 0;
+}
+
+void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
+{
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+               return;
+
+       queue_work(nvme_wq, &ctrl->err_work);
+}
+
+static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
+               struct nvme_completion *cqe)
+{
+       struct request *rq;
+       struct nvme_tcp_request *req;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag 0x%x not found\n",
+                       nvme_tcp_queue_id(queue), cqe->command_id);
+               nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+               return -EINVAL;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       nvme_end_request(rq, cqe->status, cqe->result);
+
+       return 0;
+}
+
+static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_data_pdu *pdu)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_request *req;
+       struct request *rq;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       if (!blk_rq_payload_bytes(rq)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x unexpected data\n",
+                       nvme_tcp_queue_id(queue), rq->tag);
+               return -EIO;
+       }
+
+       rcv->data_remaining = le32_to_cpu(pdu->data_length);
+       /* No support for out-of-order */
+       WARN_ON(le32_to_cpu(pdu->data_offset));
+
+       return 0;
+
+}
+
+static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_rsp_pdu *pdu)
+{
+       struct nvme_completion *cqe = &pdu->cqe;
+       int ret = 0;
+
+       /*
+        * AEN requests are special as they don't time out and can
+        * survive any kind of queue freeze and often don't respond to
+        * aborts.  We don't even bother to allocate a struct request
+        * for them but rather special case them here.
+        */
+       if (unlikely(nvme_tcp_queue_id(queue) == 0 &&
+           cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH))
+               nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
+                               &cqe->result);
+       else
+               ret = nvme_tcp_process_nvme_cqe(queue, cqe);
+
+       return ret;
+}
+
+static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
+               struct nvme_tcp_r2t_pdu *pdu)
+{
+       struct nvme_tcp_data_pdu *data = req->pdu;
+       struct nvme_tcp_queue *queue = req->queue;
+       struct request *rq = blk_mq_rq_from_pdu(req);
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       u8 ddgst = nvme_tcp_ddgst_len(queue);
+
+       req->pdu_len = le32_to_cpu(pdu->r2t_length);
+       req->pdu_sent = 0;
+
+       if (unlikely(req->snd.data_sent + req->pdu_len > req->data_len)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "req %d r2t length %u exceeded data length %u (%zu sent)\n",
+                       rq->tag, req->pdu_len, req->data_len,
+                       req->snd.data_sent);
+               return -EPROTO;
+       }
+
+       if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->snd.data_sent)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "req %d unexpected r2t offset %u (expected %zu)\n",
+                       rq->tag, le32_to_cpu(pdu->r2t_offset),
+                       req->snd.data_sent);
+               return -EPROTO;
+       }
+
+       memset(data, 0, sizeof(*data));
+       data->hdr.type = nvme_tcp_h2c_data;
+       data->hdr.flags = NVME_TCP_F_DATA_LAST;
+       if (queue->hdr_digest)
+               data->hdr.flags |= NVME_TCP_F_HDGST;
+       if (queue->data_digest)
+               data->hdr.flags |= NVME_TCP_F_DDGST;
+       data->hdr.hlen = sizeof(*data);
+       data->hdr.pdo = data->hdr.hlen + hdgst;
+       data->hdr.plen =
+               cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
+       data->ttag = pdu->ttag;
+       data->command_id = rq->tag;
+       data->data_offset = cpu_to_le32(req->snd.data_sent);
+       data->data_length = cpu_to_le32(req->pdu_len);
+       return 0;
+}
+
+static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_r2t_pdu *pdu)
+{
+       struct nvme_tcp_request *req;
+       struct request *rq;
+       int ret;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       ret = nvme_tcp_setup_h2c_data_pdu(req, pdu);
+       if (unlikely(ret))
+               return ret;
+
+       req->snd.state = NVME_TCP_SEND_H2C_PDU;
+       req->snd.offset = 0;
+
+       nvme_tcp_queue_request(req);
+
+       return 0;
+}
+
+static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+               unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_hdr *hdr;
+       size_t rcv_len = min_t(size_t, *len, rcv->pdu_remaining);
+       int ret;
+
+       ret = skb_copy_bits(skb, *offset, &rcv->pdu[rcv->pdu_offset], rcv_len);
+       if (unlikely(ret))
+               return ret;
+
+       rcv->pdu_remaining -= rcv_len;
+       rcv->pdu_offset += rcv_len;
+       *offset += rcv_len;
+       *len -= rcv_len;
+       if (queue->rcv.pdu_remaining)
+               return 0;
+
+       hdr = (void *)rcv->pdu;
+       if (queue->hdr_digest) {
+               ret = nvme_tcp_verify_hdgst(queue, rcv->pdu, hdr->hlen);
+               if (unlikely(ret))
+                       return ret;
+       }
+
+
+       if (queue->data_digest) {
+               ret = nvme_tcp_check_ddgst(queue, rcv->pdu);
+               if (unlikely(ret))
+                       return ret;
+       }
+
+       switch (hdr->type) {
+       case nvme_tcp_c2h_data:
+               ret = nvme_tcp_handle_c2h_data(queue, (void *)rcv->pdu);
+               break;
+       case nvme_tcp_rsp:
+               nvme_tcp_init_recv_ctx(queue);
+               ret = nvme_tcp_handle_comp(queue, (void *)rcv->pdu);
+               break;
+       case nvme_tcp_r2t:
+               nvme_tcp_init_recv_ctx(queue);
+               ret = nvme_tcp_handle_r2t(queue, (void *)rcv->pdu);
+               break;
+       default:
+               dev_err(queue->ctrl->ctrl.device, "unsupported pdu type (%d)\n",
+                       hdr->type);
+               return -EINVAL;
+       }
+
+       return ret;
+}
+
+static void nvme_tcp_init_recv_iter(struct nvme_tcp_request *req)
+{
+       struct bio *bio = req->rcv.curr_bio;
+       struct bio_vec *vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+       unsigned int nsegs = bio_segments(bio);
+
+       iov_iter_bvec(&req->rcv.iter, READ, vec, nsegs,
+               bio->bi_iter.bi_size);
+       req->rcv.iter.iov_offset = bio->bi_iter.bi_bvec_done;
+}
+
+static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+                             unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_data_pdu *pdu = (void *)rcv->pdu;
+       struct nvme_tcp_request *req;
+       struct request *rq;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       while (true) {
+               int recv_len, ret;
+
+               recv_len = min_t(size_t, *len, rcv->data_remaining);
+               if (!recv_len)
+                       break;
+
+               /*
+                * FIXME: This assumes that data comes in-order,
+                *  need to handle the out-of-order case.
+                */
+               if (!iov_iter_count(&req->rcv.iter)) {
+                       req->rcv.curr_bio = req->rcv.curr_bio->bi_next;
+
+                       /*
+                        * If we don`t have any bios it means that controller
+                        * sent more data than we requested, hence error
+                        */
+                       if (!req->rcv.curr_bio) {
+                               dev_err(queue->ctrl->ctrl.device,
+                                       "queue %d no space in request %#x",
+                                       nvme_tcp_queue_id(queue), rq->tag);
+                               nvme_tcp_init_recv_ctx(queue);
+                               return -EIO;
+                       }
+                       nvme_tcp_init_recv_iter(req);
+               }
+
+               /* we can read only from what is left in this bio */
+               recv_len = min_t(size_t, recv_len,
+                               iov_iter_count(&req->rcv.iter));
+
+               if (queue->data_digest)
+                       ret = skb_copy_and_hash_datagram_iter(skb, *offset,
+                               &req->rcv.iter, recv_len, queue->rcv_hash);
+               else
+                       ret = skb_copy_datagram_iter(skb, *offset,
+                                       &req->rcv.iter, recv_len);
+               if (ret) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "queue %d failed to copy request %#x data",
+                               nvme_tcp_queue_id(queue), rq->tag);
+                       return ret;
+               }
+
+               *len -= recv_len;
+               *offset += recv_len;
+               rcv->data_remaining -= recv_len;
+       }
+
+       if (!rcv->data_remaining) {
+               if (queue->data_digest) {
+                       nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
+                       rcv->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
+               } else {
+                       nvme_tcp_init_recv_ctx(queue);
+               }
+       }
+
+       return 0;
+}
+
+static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
+               struct sk_buff *skb, unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       char *ddgst = (char *)&queue->recv_ddgst;
+       size_t recv_len = min_t(size_t, *len, rcv->ddgst_remaining);
+       off_t off = NVME_TCP_DIGEST_LENGTH - rcv->ddgst_remaining;
+       int ret;
+
+       ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
+       if (unlikely(ret))
+               return ret;
+
+       rcv->ddgst_remaining -= recv_len;
+       *offset += recv_len;
+       *len -= recv_len;
+       if (rcv->ddgst_remaining)
+               return 0;
+
+       if (queue->recv_ddgst != queue->exp_ddgst) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "data digest error: recv %#x expected %#x\n",
+                       le32_to_cpu(queue->recv_ddgst),
+                       le32_to_cpu(queue->exp_ddgst));
+               return -EIO;
+       }
+
+       nvme_tcp_init_recv_ctx(queue);
+       return 0;
+}
+
+static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+                            unsigned int offset, size_t len)
+{
+       struct nvme_tcp_queue *queue = desc->arg.data;
+       size_t consumed = len;
+       int result;
+
+       while (len) {
+               switch (nvme_tcp_recv_state(queue)) {
+               case NVME_TCP_RECV_PDU:
+                       result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
+                       break;
+               case NVME_TCP_RECV_DATA:
+                       result = nvme_tcp_recv_data(queue, skb, &offset, &len);
+                       break;
+               case NVME_TCP_RECV_DDGST:
+                       result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
+                       break;
+               default:
+                       result = -EFAULT;
+               }
+               if (result) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "receive failed:  %d\n", result);
+                       queue->rd_enabled = false;
+                       nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+                       return result;
+               }
+       }
+
+       return consumed;
+}
+
+static void nvme_tcp_data_ready(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+       if (unlikely(!queue || !queue->rd_enabled))
+               goto done;
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+done:
+       read_unlock(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_write_space(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock_bh(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+
+       if (!queue)
+               goto done;
+
+       if (sk_stream_is_writeable(sk)) {
+               clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+               queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+       }
+done:
+       read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_state_change(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+       if (!queue)
+               goto done;
+
+       switch (sk->sk_state) {
+       case TCP_CLOSE:
+       case TCP_CLOSE_WAIT:
+       case TCP_LAST_ACK:
+       case TCP_FIN_WAIT1:
+       case TCP_FIN_WAIT2:
+               /* fallthrough */
+               nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+               break;
+       default:
+               dev_info(queue->ctrl->ctrl.device,
+                       "queue %d socket state %d\n",
+                       nvme_tcp_queue_id(queue), sk->sk_state);
+       }
+
+       queue->sc(sk);
+done:
+       read_unlock(&sk->sk_callback_lock);
+}
+
+static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
+{
+       queue->request = NULL;
+}
+
+static void nvme_tcp_fail_request(struct nvme_tcp_request *req)
+{
+       union nvme_result res = {};
+
+       nvme_end_request(blk_mq_rq_from_pdu(req),
+               NVME_SC_DATA_XFER_ERROR, res);
+}
+
+static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+
+       while (true) {
+               struct page *page = nvme_tcp_req_cur_page(req);
+               size_t offset = nvme_tcp_req_cur_offset(req);
+               size_t len = nvme_tcp_req_cur_length(req);
+               bool last = nvme_tcp_pdu_last_send(req, len);
+               int ret, flags = MSG_DONTWAIT;
+
+               if (last && !queue->data_digest)
+                       flags |= MSG_EOR;
+               else
+                       flags |= MSG_MORE;
+
+               ret = kernel_sendpage(queue->sock, page, offset, len, flags);
+               if (ret <= 0)
+                       return ret;
+
+               nvme_tcp_advance_req(req, ret);
+               if (queue->data_digest)
+                       nvme_tcp_ddgst_update(queue->snd_hash, page, offset, ret);
+
+               /* fully successfull last write*/
+               if (last && ret == len) {
+                       if (queue->data_digest) {
+                               nvme_tcp_ddgst_final(queue->snd_hash,
+                                       &req->ddgst);
+                               req->snd.state = NVME_TCP_SEND_DDGST;
+                               req->snd.offset = 0;
+                       } else {
+                               nvme_tcp_done_send_req(queue);
+                       }
+                       return 1;
+               }
+       }
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       struct nvme_tcp_send_ctx *snd = &req->snd;
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       bool inline_data = nvme_tcp_has_inline_data(req);
+       int flags = MSG_DONTWAIT | (inline_data ? MSG_MORE : MSG_EOR);
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       int len = sizeof(*pdu) + hdgst - snd->offset;
+       int ret;
+
+       if (queue->hdr_digest && !snd->offset)
+               nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+       ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+                       offset_in_page(pdu) + snd->offset, len,  flags);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       len -= ret;
+       if (!len) {
+               if (inline_data) {
+                       req->snd.state = NVME_TCP_SEND_DATA;
+                       if (queue->data_digest)
+                               crypto_ahash_init(queue->snd_hash);
+                       nvme_tcp_init_send_iter(req);
+               } else {
+                       nvme_tcp_done_send_req(queue);
+               }
+               return 1;
+       }
+       snd->offset += ret;
+
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       struct nvme_tcp_send_ctx *snd = &req->snd;
+       struct nvme_tcp_data_pdu *pdu = req->pdu;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       int len = sizeof(*pdu) - snd->offset + hdgst;
+       int ret;
+
+       if (queue->hdr_digest && !snd->offset)
+               nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+       ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+                       offset_in_page(pdu) + snd->offset, len,
+                       MSG_DONTWAIT | MSG_MORE);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       len -= ret;
+       if (!len) {
+               req->snd.state = NVME_TCP_SEND_DATA;
+               if (queue->data_digest)
+                       crypto_ahash_init(queue->snd_hash);
+               if (!req->snd.data_sent)
+                       nvme_tcp_init_send_iter(req);
+               return 1;
+       }
+       snd->offset += ret;
+
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       int ret;
+       struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+       struct kvec iov = {
+               .iov_base = &req->ddgst + req->snd.offset,
+               .iov_len = NVME_TCP_DIGEST_LENGTH - req->snd.offset
+       };
+
+       ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       if (req->snd.offset + ret == NVME_TCP_DIGEST_LENGTH) {
+               nvme_tcp_done_send_req(queue);
+               return 1;
+       }
+
+       req->snd.offset += ret;
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_request *req;
+       int ret = 1;
+
+       if (!queue->request) {
+               queue->request = nvme_tcp_fetch_request(queue);
+               if (!queue->request)
+                       return 0;
+       }
+       req = queue->request;
+
+       if (req->snd.state == NVME_TCP_SEND_CMD_PDU) {
+               ret = nvme_tcp_try_send_cmd_pdu(req);
+               if (ret <= 0)
+                       goto done;
+               if (!nvme_tcp_has_inline_data(req))
+                       return ret;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_H2C_PDU) {
+               ret = nvme_tcp_try_send_data_pdu(req);
+               if (ret <= 0)
+                       goto done;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_DATA) {
+               ret = nvme_tcp_try_send_data(req);
+               if (ret <= 0)
+                       goto done;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_DDGST)
+               ret = nvme_tcp_try_send_ddgst(req);
+done:
+       if (ret == -EAGAIN)
+               ret = 0;
+       return ret;
+}
+
+static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
+{
+       struct sock *sk = queue->sock->sk;
+       read_descriptor_t rd_desc;
+       int consumed;
+
+       rd_desc.arg.data = queue;
+       rd_desc.count = 1;
+       lock_sock(sk);
+       consumed = tcp_read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+       release_sock(sk);
+       return consumed;
+}
+
+static void nvme_tcp_io_work(struct work_struct *w)
+{
+       struct nvme_tcp_queue *queue =
+               container_of(w, struct nvme_tcp_queue, io_work);
+       unsigned long start = jiffies + msecs_to_jiffies(1);
+
+       do {
+               bool pending = false;
+               int result;
+
+               result = nvme_tcp_try_send(queue);
+               if (result > 0) {
+                       pending = true;
+               } else if (unlikely(result < 0)) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "failed to send request %d\n", result);
+                       if (result != -EPIPE)
+                               nvme_tcp_fail_request(queue->request);
+                       nvme_tcp_done_send_req(queue);
+                       return;
+               }
+
+               result = nvme_tcp_try_recv(queue);
+               if (result > 0)
+                       pending = true;
+
+               if (!pending)
+                       return;
+
+       } while (time_after(jiffies, start)); /* quota is exhausted */
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
+{
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
+
+       ahash_request_free(queue->rcv_hash);
+       ahash_request_free(queue->snd_hash);
+       crypto_free_ahash(tfm);
+}
+
+static int nvme_tcp_alloc_crypto(struct nvme_tcp_queue *queue)
+{
+       struct crypto_ahash *tfm;
+
+       tfm = crypto_alloc_ahash("crc32c", 0, CRYPTO_ALG_ASYNC);
+       if (IS_ERR(tfm))
+               return PTR_ERR(tfm);
+
+       queue->snd_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+       if (!queue->snd_hash)
+               goto free_tfm;
+       ahash_request_set_callback(queue->snd_hash, 0, NULL, NULL);
+
+       queue->rcv_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+       if (!queue->rcv_hash)
+               goto free_snd_hash;
+       ahash_request_set_callback(queue->rcv_hash, 0, NULL, NULL);
+
+       return 0;
+free_snd_hash:
+       ahash_request_free(queue->snd_hash);
+free_tfm:
+       crypto_free_ahash(tfm);
+       return -ENOMEM;
+}
+
+static void nvme_tcp_free_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+       struct nvme_tcp_request *async = &ctrl->async_req;
+
+       page_frag_free(async->pdu);
+}
+
+static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+       struct nvme_tcp_request *async = &ctrl->async_req;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       async->pdu = page_frag_alloc(&queue->pf_cache,
+               sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+               GFP_KERNEL | __GFP_ZERO);
+       if (!async->pdu)
+               return -ENOMEM;
+
+       async->queue = &ctrl->queues[0];
+       return 0;
+}
+
+static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+       if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
+               return;
+
+       if (queue->hdr_digest || queue->data_digest)
+               nvme_tcp_free_crypto(queue);
+
+       sock_release(queue->sock);
+       kfree(queue->rcv.pdu);
+}
+
+static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_icreq_pdu *icreq;
+       struct nvme_tcp_icresp_pdu *icresp;
+       struct msghdr msg = {};
+       struct kvec iov;
+       bool ctrl_hdgst, ctrl_ddgst;
+       int ret;
+
+       icreq = kzalloc(sizeof(*icreq), GFP_KERNEL);
+       if (!icreq)
+               return -ENOMEM;
+
+       icresp = kzalloc(sizeof(*icresp), GFP_KERNEL);
+       if (!icresp) {
+               ret = -ENOMEM;
+               goto free_icreq;
+       }
+
+       icreq->hdr.type = nvme_tcp_icreq;
+       icreq->hdr.hlen = sizeof(*icreq);
+       icreq->hdr.pdo = 0;
+       icreq->hdr.plen = cpu_to_le32(icreq->hdr.hlen);
+       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
+       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
+       icreq->hpda = 0; /* no alignment constraint */
+       if (queue->hdr_digest)
+               icreq->digest |= NVME_TCP_HDR_DIGEST_ENABLE;
+       if (queue->data_digest)
+               icreq->digest |= NVME_TCP_DATA_DIGEST_ENABLE;
+
+       iov.iov_base = icreq;
+       iov.iov_len = sizeof(*icreq);
+       ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+       if (ret < 0)
+               goto free_icresp;
+
+       memset(&msg, 0, sizeof(msg));
+       iov.iov_base = icresp;
+       iov.iov_len = sizeof(*icresp);
+       ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+                       iov.iov_len, msg.msg_flags);
+       if (ret < 0)
+               goto free_icresp;
+
+       ret = -EINVAL;
+       if (icresp->hdr.type != nvme_tcp_icresp) {
+               pr_err("queue %d: bad type returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->hdr.type);
+               goto free_icresp;
+       }
+
+       if (le32_to_cpu(icresp->hdr.plen) != sizeof(*icresp)) {
+               pr_err("queue %d: bad pdu length returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->hdr.plen);
+               goto free_icresp;
+       }
+
+       if (icresp->pfv != NVME_TCP_PFV_1_0) {
+               pr_err("queue %d: bad pfv returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->pfv);
+               goto free_icresp;
+       }
+
+       ctrl_ddgst = !!(icresp->digest & NVME_TCP_DATA_DIGEST_ENABLE);
+       if ((queue->data_digest && !ctrl_ddgst) ||
+           (!queue->data_digest && ctrl_ddgst)) {
+               pr_err("queue %d: data digest mismatch host: %s ctrl: %s\n",
+                       nvme_tcp_queue_id(queue),
+                       queue->data_digest ? "enabled" : "disabled",
+                       ctrl_ddgst ? "enabled" : "disabled");
+               goto free_icresp;
+       }
+
+       ctrl_hdgst = !!(icresp->digest & NVME_TCP_HDR_DIGEST_ENABLE);
+       if ((queue->hdr_digest && !ctrl_hdgst) ||
+           (!queue->hdr_digest && ctrl_hdgst)) {
+               pr_err("queue %d: header digest mismatch host: %s ctrl: %s\n",
+                       nvme_tcp_queue_id(queue),
+                       queue->hdr_digest ? "enabled" : "disabled",
+                       ctrl_hdgst ? "enabled" : "disabled");
+               goto free_icresp;
+       }
+
+       if (icresp->cpda != 0) {
+               pr_err("queue %d: unsupported cpda returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->cpda);
+               goto free_icresp;
+       }
+
+       ret = 0;
+free_icresp:
+       kfree(icresp);
+free_icreq:
+       kfree(icreq);
+       return ret;
+}
+
+static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
+               int qid, size_t queue_size)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+       struct linger sol = { .l_onoff = 1, .l_linger = 0 };
+       int ret, opt, rcv_pdu_size;
+
+       queue->ctrl = ctrl;
+       INIT_LIST_HEAD(&queue->send_list);
+       spin_lock_init(&queue->lock);
+       INIT_WORK(&queue->io_work, nvme_tcp_io_work);
+       queue->queue_size = queue_size;
+
+       if (qid > 0)
+               queue->cmnd_capsule_len = ctrl->ctrl.ioccsz * 16;
+       else
+               queue->cmnd_capsule_len = sizeof(struct nvme_command) +
+                                               NVME_TCP_ADMIN_CCSZ;
+
+       ret = sock_create(ctrl->addr.ss_family, SOCK_STREAM,
+                       IPPROTO_TCP, &queue->sock);
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to create socket: %d\n", ret);
+               return ret;
+       }
+
+       /* Single syn retry */
+       opt = 1;
+       ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_SYNCNT,
+                       (char *)&opt, sizeof(opt));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set TCP_SYNCNT sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       /* Set TCP no delay */
+       opt = 1;
+       ret = kernel_setsockopt(queue->sock, IPPROTO_TCP,
+                       TCP_NODELAY, (char *)&opt, sizeof(opt));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set TCP_NODELAY sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       /*
+        * Cleanup whatever is sitting in the TCP transmit queue on socket
+        * close. This is done to prevent stale data from being sent should
+        * the network connection be restored before TCP times out.
+        */
+       ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_LINGER,
+                       (char *)&sol, sizeof(sol));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set SO_LINGER sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       queue->sock->sk->sk_allocation = GFP_ATOMIC;
+       queue->io_cpu = (qid == 0) ? 0 : qid - 1;
+       queue->request = NULL;
+       queue->rcv.data_remaining = 0;
+       queue->rcv.ddgst_remaining = 0;
+       queue->rcv.pdu_remaining = 0;
+       queue->rcv.pdu_offset = 0;
+       sk_set_memalloc(queue->sock->sk);
+
+       if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR) {
+               ret = kernel_bind(queue->sock, (struct sockaddr *)&ctrl->src_addr,
+                       sizeof(ctrl->src_addr));
+               if (ret) {
+                       dev_err(ctrl->ctrl.device,
+                               "failed to bind queue %d socket %d\n",
+                               qid, ret);
+                       goto err_sock;
+               }
+       }
+
+       queue->hdr_digest = nctrl->opts->hdr_digest;
+       queue->data_digest = nctrl->opts->data_digest;
+       if (queue->hdr_digest || queue->data_digest) {
+               ret = nvme_tcp_alloc_crypto(queue);
+               if (ret) {
+                       dev_err(ctrl->ctrl.device,
+                               "failed to allocate queue %d crypto\n", qid);
+                       goto err_sock;
+               }
+       }
+
+       rcv_pdu_size = sizeof(struct nvme_tcp_rsp_pdu) +
+                       nvme_tcp_hdgst_len(queue);
+       queue->rcv.pdu = kmalloc(rcv_pdu_size, GFP_KERNEL);
+       if (!queue->rcv.pdu) {
+               ret = -ENOMEM;
+               goto err_crypto;
+       }
+
+       dev_dbg(ctrl->ctrl.device, "connecting queue %d\n",
+                       nvme_tcp_queue_id(queue));
+
+       ret = kernel_connect(queue->sock, (struct sockaddr *)&ctrl->addr,
+               sizeof(ctrl->addr), 0);
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to connect socket: %d\n", ret);
+               goto err_rcv_pdu;
+       }
+
+       ret = nvme_tcp_init_connection(queue);
+       if (ret)
+               goto err_init_connect;
+
+       queue->rd_enabled = true;
+       set_bit(NVME_TCP_Q_ALLOCATED, &queue->flags);
+       nvme_tcp_init_recv_ctx(queue);
+
+       write_lock_bh(&queue->sock->sk->sk_callback_lock);
+       queue->sock->sk->sk_user_data = queue;
+       queue->sc = queue->sock->sk->sk_state_change;
+       queue->dr = queue->sock->sk->sk_data_ready;
+       queue->ws = queue->sock->sk->sk_write_space;
+       queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
+       queue->sock->sk->sk_state_change = nvme_tcp_state_change;
+       queue->sock->sk->sk_write_space = nvme_tcp_write_space;
+       write_unlock_bh(&queue->sock->sk->sk_callback_lock);
+
+       return 0;
+
+err_init_connect:
+       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+err_rcv_pdu:
+       kfree(queue->rcv.pdu);
+err_crypto:
+       if (queue->hdr_digest || queue->data_digest)
+               nvme_tcp_free_crypto(queue);
+err_sock:
+       sock_release(queue->sock);
+       queue->sock = NULL;
+       return ret;
+}
+
+static void nvme_tcp_restore_sock_calls(struct nvme_tcp_queue *queue)
+{
+       struct socket *sock = queue->sock;
+
+       write_lock_bh(&sock->sk->sk_callback_lock);
+       sock->sk->sk_user_data  = NULL;
+       sock->sk->sk_data_ready = queue->dr;
+       sock->sk->sk_state_change = queue->sc;
+       sock->sk->sk_write_space  = queue->ws;
+       write_unlock_bh(&sock->sk->sk_callback_lock);
+}
+
+static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
+{
+       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+       nvme_tcp_restore_sock_calls(queue);
+       cancel_work_sync(&queue->io_work);
+}
+
+static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+       if (!test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags))
+               return;
+
+       __nvme_tcp_stop_queue(queue);
+}
+
+static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       int ret;
+
+       if (idx)
+               ret = nvmf_connect_io_queue(nctrl, idx);
+       else
+               ret = nvmf_connect_admin_queue(nctrl);
+
+       if (!ret) {
+               set_bit(NVME_TCP_Q_LIVE, &ctrl->queues[idx].flags);
+       } else {
+               __nvme_tcp_stop_queue(&ctrl->queues[idx]);
+               dev_err(nctrl->device,
+                       "failed to connect queue: %d ret=%d\n", idx, ret);
+       }
+       return ret;
+}
+
+static void nvme_tcp_free_tagset(struct nvme_ctrl *nctrl,
+               struct blk_mq_tag_set *set)
+{
+       blk_mq_free_tag_set(set);
+}
+
+static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
+               bool admin)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct blk_mq_tag_set *set;
+       int ret;
+
+       if (admin) {
+               set = &ctrl->admin_tag_set;
+               memset(set, 0, sizeof(*set));
+               set->ops = &nvme_tcp_admin_mq_ops;
+               set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+               set->reserved_tags = 2; /* connect + keep-alive */
+               set->numa_node = NUMA_NO_NODE;
+               set->cmd_size = sizeof(struct nvme_tcp_request);
+               set->driver_data = ctrl;
+               set->nr_hw_queues = 1;
+               set->timeout = ADMIN_TIMEOUT;
+       } else {
+               set = &ctrl->tag_set;
+               memset(set, 0, sizeof(*set));
+               set->ops = &nvme_tcp_mq_ops;
+               set->queue_depth = nctrl->sqsize + 1;
+               set->reserved_tags = 1; /* fabric connect */
+               set->numa_node = NUMA_NO_NODE;
+               set->flags = BLK_MQ_F_SHOULD_MERGE;
+               set->cmd_size = sizeof(struct nvme_tcp_request);
+               set->driver_data = ctrl;
+               set->nr_hw_queues = nctrl->queue_count - 1;
+               set->timeout = NVME_IO_TIMEOUT;
+       }
+
+       ret = blk_mq_alloc_tag_set(set);
+       if (ret)
+               return ERR_PTR(ret);
+
+       return set;
+}
+
+static void nvme_tcp_free_admin_queue(struct nvme_ctrl *ctrl)
+{
+       if (to_tcp_ctrl(ctrl)->async_req.pdu) {
+               nvme_tcp_free_async_req(to_tcp_ctrl(ctrl));
+               to_tcp_ctrl(ctrl)->async_req.pdu = NULL;
+       }
+
+       nvme_tcp_free_queue(ctrl, 0);
+}
+
+static void nvme_tcp_free_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i;
+
+       for (i = 1; i < ctrl->queue_count; i++)
+               nvme_tcp_free_queue(ctrl, i);
+}
+
+static void nvme_tcp_stop_admin_queue(struct nvme_ctrl *ctrl)
+{
+       nvme_tcp_stop_queue(ctrl, 0);
+}
+
+static void nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i;
+
+       for (i = 1; i < ctrl->queue_count; i++)
+               nvme_tcp_stop_queue(ctrl, i);
+}
+
+static int nvme_tcp_start_admin_queue(struct nvme_ctrl *ctrl)
+{
+       return nvme_tcp_start_queue(ctrl, 0);
+}
+
+static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i, ret = 0;
+
+       for (i = 1; i < ctrl->queue_count; i++) {
+               ret = nvme_tcp_start_queue(ctrl, i);
+               if (ret)
+                       goto out_stop_queues;
+       }
+
+       return 0;
+
+out_stop_queues:
+       for (i--; i >= 1; i--)
+               nvme_tcp_stop_queue(ctrl, i);
+       return ret;
+}
+
+static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
+{
+       int ret;
+
+       ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
+       if (ret)
+               return ret;
+
+       ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
+       if (ret)
+               goto out_free_queue;
+
+       return 0;
+
+out_free_queue:
+       nvme_tcp_free_queue(ctrl, 0);
+       return ret;
+}
+
+static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i, ret;
+
+       for (i = 1; i < ctrl->queue_count; i++) {
+               ret = nvme_tcp_alloc_queue(ctrl, i,
+                               ctrl->sqsize + 1);
+               if (ret)
+                       goto out_free_queues;
+       }
+
+       return 0;
+
+out_free_queues:
+       for (i--; i >= 1; i--)
+               nvme_tcp_free_queue(ctrl, i);
+
+       return ret;
+}
+
+static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl)
+{
+       return min(ctrl->queue_count - 1, num_online_cpus());
+}
+
+static int nvme_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+       unsigned int nr_io_queues;
+       int ret;
+
+       nr_io_queues = nvme_tcp_nr_io_queues(ctrl);
+       ret = nvme_set_queue_count(ctrl, &nr_io_queues);
+       if (ret)
+               return ret;
+
+       ctrl->queue_count = nr_io_queues + 1;
+       if (ctrl->queue_count < 2)
+               return 0;
+
+       dev_info(ctrl->device,
+               "creating %d I/O queues.\n", nr_io_queues);
+
+       return nvme_tcp_alloc_io_queues(ctrl);
+}
+
+void nvme_tcp_destroy_io_queues(struct nvme_ctrl *ctrl, bool remove)
+{
+       nvme_tcp_stop_io_queues(ctrl);
+       if (remove) {
+               if (ctrl->ops->flags & NVME_F_FABRICS)
+                       blk_cleanup_queue(ctrl->connect_q);
+               nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+       }
+       nvme_tcp_free_io_queues(ctrl);
+}
+
+int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
+{
+       int ret;
+
+       ret = nvme_alloc_io_queues(ctrl);
+       if (ret)
+               return ret;
+
+       if (new) {
+               ctrl->tagset = nvme_tcp_alloc_tagset(ctrl, false);
+               if (IS_ERR(ctrl->tagset)) {
+                       ret = PTR_ERR(ctrl->tagset);
+                       goto out_free_io_queues;
+               }
+
+               if (ctrl->ops->flags & NVME_F_FABRICS) {
+                       ctrl->connect_q = blk_mq_init_queue(ctrl->tagset);
+                       if (IS_ERR(ctrl->connect_q)) {
+                               ret = PTR_ERR(ctrl->connect_q);
+                               goto out_free_tag_set;
+                       }
+               }
+       } else {
+               blk_mq_update_nr_hw_queues(ctrl->tagset,
+                       ctrl->queue_count - 1);
+       }
+
+       ret = nvme_tcp_start_io_queues(ctrl);
+       if (ret)
+               goto out_cleanup_connect_q;
+
+       return 0;
+
+out_cleanup_connect_q:
+       if (new && (ctrl->ops->flags & NVME_F_FABRICS))
+               blk_cleanup_queue(ctrl->connect_q);
+out_free_tag_set:
+       if (new)
+               nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+out_free_io_queues:
+       nvme_tcp_free_io_queues(ctrl);
+       return ret;
+}
+
+void nvme_tcp_destroy_admin_queue(struct nvme_ctrl *ctrl, bool remove)
+{
+       nvme_tcp_stop_admin_queue(ctrl);
+       if (remove) {
+               free_opal_dev(ctrl->opal_dev);
+               blk_cleanup_queue(ctrl->admin_q);
+               nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+       }
+       nvme_tcp_free_admin_queue(ctrl);
+}
+
+int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
+{
+       int error;
+
+       error = nvme_tcp_alloc_admin_queue(ctrl);
+       if (error)
+               return error;
+
+       if (new) {
+               ctrl->admin_tagset = nvme_tcp_alloc_tagset(ctrl, true);
+               if (IS_ERR(ctrl->admin_tagset)) {
+                       error = PTR_ERR(ctrl->admin_tagset);
+                       goto out_free_queue;
+               }
+
+               ctrl->admin_q = blk_mq_init_queue(ctrl->admin_tagset);
+               if (IS_ERR(ctrl->admin_q)) {
+                       error = PTR_ERR(ctrl->admin_q);
+                       goto out_free_tagset;
+               }
+       }
+
+       error = nvme_tcp_start_admin_queue(ctrl);
+       if (error)
+               goto out_cleanup_queue;
+
+       error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap);
+       if (error) {
+               dev_err(ctrl->device,
+                       "prop_get NVME_REG_CAP failed\n");
+               goto out_stop_queue;
+       }
+
+       ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize);
+
+       error = nvme_enable_ctrl(ctrl, ctrl->cap);
+       if (error)
+               goto out_stop_queue;
+
+       error = nvme_init_identify(ctrl);
+       if (error)
+               goto out_stop_queue;
+
+       return 0;
+
+out_stop_queue:
+       nvme_tcp_stop_admin_queue(ctrl);
+out_cleanup_queue:
+       if (new)
+               blk_cleanup_queue(ctrl->admin_q);
+out_free_tagset:
+       if (new)
+               nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+out_free_queue:
+       nvme_tcp_free_admin_queue(ctrl);
+       return error;
+}
+
+static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
+               bool remove)
+{
+       blk_mq_quiesce_queue(ctrl->admin_q);
+       nvme_tcp_stop_admin_queue(ctrl);
+       blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl);
+       blk_mq_unquiesce_queue(ctrl->admin_q);
+       nvme_tcp_destroy_admin_queue(ctrl, remove);
+}
+
+static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
+               bool remove)
+{
+       if (ctrl->queue_count > 1) {
+               nvme_stop_queues(ctrl);
+               nvme_tcp_stop_io_queues(ctrl);
+               blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl);
+               if (remove)
+                       nvme_start_queues(ctrl);
+               nvme_tcp_destroy_io_queues(ctrl, remove);
+       }
+}
+
+static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
+{
+       /* If we are resetting/deleting then do nothing */
+       if (ctrl->state != NVME_CTRL_CONNECTING) {
+               WARN_ON_ONCE(ctrl->state == NVME_CTRL_NEW ||
+                       ctrl->state == NVME_CTRL_LIVE);
+               return;
+       }
+
+       if (nvmf_should_reconnect(ctrl)) {
+               dev_info(ctrl->device, "Reconnecting in %d seconds...\n",
+                       ctrl->opts->reconnect_delay);
+               queue_delayed_work(nvme_wq, &ctrl->connect_work,
+                               ctrl->opts->reconnect_delay * HZ);
+       } else {
+               dev_info(ctrl->device, "Removing controller...\n");
+               nvme_delete_ctrl(ctrl);
+       }
+}
+
+static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
+{
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+       int ret = -EINVAL;
+
+       ret = nvme_tcp_configure_admin_queue(ctrl, new);
+       if (ret)
+               return ret;
+
+       if (ctrl->icdoff) {
+               dev_err(ctrl->device, "icdoff is not supported!\n");
+               goto destroy_admin;
+       }
+
+       if (opts->queue_size > ctrl->sqsize + 1)
+               dev_warn(ctrl->device,
+                       "queue_size %zu > ctrl sqsize %u, clamping down\n",
+                       opts->queue_size, ctrl->sqsize + 1);
+
+       if (ctrl->sqsize + 1 > ctrl->maxcmd) {
+               dev_warn(ctrl->device,
+                       "sqsize %u > ctrl maxcmd %u, clamping down\n",
+                       ctrl->sqsize + 1, ctrl->maxcmd);
+               ctrl->sqsize = ctrl->maxcmd - 1;
+       }
+
+       if (ctrl->queue_count > 1) {
+               ret = nvme_tcp_configure_io_queues(ctrl, new);
+               if (ret)
+                       goto destroy_admin;
+       }
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               ret = -EINVAL;
+               goto destroy_io;
+       }
+
+       nvme_start_ctrl(ctrl);
+       return 0;
+
+destroy_io:
+       if (ctrl->queue_count > 1)
+               nvme_tcp_destroy_io_queues(ctrl, new);
+destroy_admin:
+       nvme_tcp_stop_admin_queue(ctrl);
+       nvme_tcp_destroy_admin_queue(ctrl, new);
+       return ret;
+}
+
+static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+                       struct nvme_ctrl, connect_work);
+
+       ++ctrl->nr_reconnects;
+
+       if (nvme_tcp_setup_ctrl(ctrl, false))
+               goto requeue;
+
+       dev_info(ctrl->device, "Successfully reconnected (%d attepmpt)\n",
+                       ctrl->nr_reconnects);
+
+       ctrl->nr_reconnects = 0;
+
+       return;
+
+requeue:
+       dev_info(ctrl->device, "Failed reconnect attempt %d\n",
+                       ctrl->nr_reconnects);
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_error_recovery_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl = container_of(work,
+                       struct nvme_ctrl, err_work);
+
+       nvme_stop_keep_alive(ctrl);
+       nvme_tcp_teardown_io_queues(ctrl, false);
+       /* unquiesce to fail fast pending requests */
+       nvme_start_queues(ctrl);
+       nvme_tcp_teardown_admin_queue(ctrl, false);
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               return;
+       }
+
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
+{
+       nvme_tcp_teardown_io_queues(ctrl, shutdown);
+       if (shutdown)
+               nvme_shutdown_ctrl(ctrl);
+       else
+               nvme_disable_ctrl(ctrl, ctrl->cap);
+       nvme_tcp_teardown_admin_queue(ctrl, shutdown);
+}
+
+static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
+{
+       nvme_tcp_teardown_ctrl(ctrl, true);
+}
+
+static void nvme_reset_ctrl_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl =
+               container_of(work, struct nvme_ctrl, reset_work);
+
+       nvme_stop_ctrl(ctrl);
+       nvme_tcp_teardown_ctrl(ctrl, false);
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               return;
+       }
+
+       if (nvme_tcp_setup_ctrl(ctrl, false))
+               goto out_fail;
+
+       return;
+
+out_fail:
+       ++ctrl->nr_reconnects;
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
+{
+       cancel_work_sync(&ctrl->err_work);
+       cancel_delayed_work_sync(&ctrl->connect_work);
+}
+
+static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+
+       if (list_empty(&ctrl->list))
+               goto free_ctrl;
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_del(&ctrl->list);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       nvmf_free_options(nctrl->opts);
+free_ctrl:
+       kfree(ctrl->queues);
+       kfree(ctrl);
+}
+
+static void nvme_tcp_set_sg_null(struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = 0;
+       sg->length = 0;
+       sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+                       NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_request *req, struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
+       sg->length = cpu_to_le32(req->data_len);
+       sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
+}
+
+static void nvme_tcp_set_sg_host_data(struct nvme_tcp_request *req,
+               struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = 0;
+       sg->length = cpu_to_le32(req->data_len);
+       sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+                       NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(arg);
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+       struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
+       struct nvme_command *cmd = &pdu->cmd;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       memset(pdu, 0, sizeof(*pdu));
+       pdu->hdr.type = nvme_tcp_cmd;
+       if (queue->hdr_digest)
+               pdu->hdr.flags |= NVME_TCP_F_HDGST;
+       pdu->hdr.hlen = sizeof(*pdu);
+       pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+
+       cmd->common.opcode = nvme_admin_async_event;
+       cmd->common.command_id = NVME_AQ_BLK_MQ_DEPTH;
+       cmd->common.flags |= NVME_CMD_SGL_METABUF;
+       nvme_tcp_set_sg_null(cmd);
+
+       ctrl->async_req.snd.state = NVME_TCP_SEND_CMD_PDU;
+       ctrl->async_req.snd.offset = 0;
+       ctrl->async_req.snd.curr_bio = NULL;
+       ctrl->async_req.rcv.curr_bio = NULL;
+       ctrl->async_req.data_len = 0;
+
+       nvme_tcp_queue_request(&ctrl->async_req);
+}
+
+static enum blk_eh_timer_return
+nvme_tcp_timeout(struct request *rq, bool reserved)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+
+       dev_dbg(ctrl->ctrl.device,
+               "queue %d: timeout request %#x type %d\n",
+               nvme_tcp_queue_id(req->queue), rq->tag,
+               pdu->hdr.type);
+
+       if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+               union nvme_result res = {};
+
+               nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+               nvme_end_request(rq, NVME_SC_ABORT_REQ, res);
+               return BLK_EH_DONE;
+       }
+
+       /* queue error recovery */
+       nvme_tcp_error_recovery(&ctrl->ctrl);
+
+       return BLK_EH_RESET_TIMER;
+}
+
+static blk_status_t nvme_tcp_map_data(struct nvme_tcp_queue *queue,
+                       struct request *rq)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       struct nvme_command *c = &pdu->cmd;
+
+       c->common.flags |= NVME_CMD_SGL_METABUF;
+
+       if (!req->data_len) {
+               nvme_tcp_set_sg_null(c);
+               return 0;
+       }
+
+       if (rq_data_dir(rq) == WRITE &&
+           req->data_len <= nvme_tcp_inline_data_size(queue))
+               nvme_tcp_set_sg_inline(queue, req, c);
+       else
+               nvme_tcp_set_sg_host_data(req, c);
+
+       return 0;
+}
+
+static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
+               struct request *rq)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       struct nvme_tcp_queue *queue = req->queue;
+       u8 hdgst = nvme_tcp_hdgst_len(queue), ddgst = 0;
+       blk_status_t ret;
+
+       ret = nvme_setup_cmd(ns, rq, &pdu->cmd);
+       if (ret)
+               return ret;
+
+       req->snd.state = NVME_TCP_SEND_CMD_PDU;
+       req->snd.offset = 0;
+       req->snd.data_sent = 0;
+       req->pdu_len = 0;
+       req->pdu_sent = 0;
+       req->data_len = blk_rq_payload_bytes(rq);
+
+       if (rq_data_dir(rq) == WRITE) {
+               req->snd.curr_bio = rq->bio;
+               if (req->data_len <= nvme_tcp_inline_data_size(queue))
+                       req->pdu_len = req->data_len;
+       } else {
+               req->rcv.curr_bio = rq->bio;
+               if (req->rcv.curr_bio)
+                       nvme_tcp_init_recv_iter(req);
+       }
+
+       pdu->hdr.type = nvme_tcp_cmd;
+       pdu->hdr.flags = 0;
+       if (queue->hdr_digest)
+               pdu->hdr.flags |= NVME_TCP_F_HDGST;
+       if (queue->data_digest && req->pdu_len) {
+               pdu->hdr.flags |= NVME_TCP_F_DDGST;
+               ddgst = nvme_tcp_ddgst_len(queue);
+       }
+       pdu->hdr.hlen = sizeof(*pdu);
+       pdu->hdr.pdo = req->pdu_len ? pdu->hdr.hlen + hdgst : 0;
+       pdu->hdr.plen =
+               cpu_to_le32(pdu->hdr.hlen + hdgst + req->pdu_len + ddgst);
+
+       ret = nvme_tcp_map_data(queue, rq);
+       if (unlikely(ret)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "Failed to map data (%d)\n", ret);
+               return ret;
+       }
+
+       return 0;
+}
+
+static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
+               const struct blk_mq_queue_data *bd)
+{
+       struct nvme_ns *ns = hctx->queue->queuedata;
+       struct nvme_tcp_queue *queue = hctx->driver_data;
+       struct request *rq = bd->rq;
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
+       blk_status_t ret;
+
+       if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
+               return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
+
+       ret = nvme_tcp_setup_cmd_pdu(ns, rq);
+       if (unlikely(ret))
+               return ret;
+
+       blk_mq_start_request(rq);
+
+       nvme_tcp_queue_request(req);
+
+       return BLK_STS_OK;
+}
+
+static struct blk_mq_ops nvme_tcp_mq_ops = {
+       .queue_rq       = nvme_tcp_queue_rq,
+       .complete       = nvme_complete_rq,
+       .init_request   = nvme_tcp_init_request,
+       .exit_request   = nvme_tcp_exit_request,
+       .init_hctx      = nvme_tcp_init_hctx,
+       .timeout        = nvme_tcp_timeout,
+};
+
+static struct blk_mq_ops nvme_tcp_admin_mq_ops = {
+       .queue_rq       = nvme_tcp_queue_rq,
+       .complete       = nvme_complete_rq,
+       .init_request   = nvme_tcp_init_request,
+       .exit_request   = nvme_tcp_exit_request,
+       .init_hctx      = nvme_tcp_init_admin_hctx,
+       .timeout        = nvme_tcp_timeout,
+};
+
+static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
+       .name                   = "tcp",
+       .module                 = THIS_MODULE,
+       .flags                  = NVME_F_FABRICS,
+       .reg_read32             = nvmf_reg_read32,
+       .reg_read64             = nvmf_reg_read64,
+       .reg_write32            = nvmf_reg_write32,
+       .free_ctrl              = nvme_tcp_free_ctrl,
+       .submit_async_event     = nvme_tcp_submit_async_event,
+       .delete_ctrl            = nvme_tcp_delete_ctrl,
+       .get_address            = nvmf_get_address,
+       .stop_ctrl              = nvme_tcp_stop_ctrl,
+};
+
+static bool
+nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts)
+{
+       struct nvme_tcp_ctrl *ctrl;
+       bool found = false;
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
+               found = nvmf_ip_options_match(&ctrl->ctrl, opts);
+               if (found)
+                       break;
+       }
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       return found;
+}
+
+static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
+               struct nvmf_ctrl_options *opts)
+{
+       struct nvme_tcp_ctrl *ctrl;
+       int ret;
+
+       ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
+       if (!ctrl)
+               return ERR_PTR(-ENOMEM);
+
+       INIT_LIST_HEAD(&ctrl->list);
+       ctrl->ctrl.opts = opts;
+       ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
+       ctrl->ctrl.sqsize = opts->queue_size - 1;
+       ctrl->ctrl.kato = opts->kato;
+
+       INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
+                       nvme_tcp_reconnect_ctrl_work);
+       INIT_WORK(&ctrl->ctrl.err_work, nvme_tcp_error_recovery_work);
+       INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
+
+       if (!(opts->mask & NVMF_OPT_TRSVCID)) {
+               opts->trsvcid =
+                       kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
+               if (!opts->trsvcid) {
+                       ret = -ENOMEM;
+                       goto out_free_ctrl;
+               }
+               opts->mask |= NVMF_OPT_TRSVCID;
+       }
+
+       ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+                       opts->traddr, opts->trsvcid, &ctrl->addr);
+       if (ret) {
+               pr_err("malformed address passed: %s:%s\n",
+                       opts->traddr, opts->trsvcid);
+               goto out_free_ctrl;
+       }
+
+       if (opts->mask & NVMF_OPT_HOST_TRADDR) {
+               ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+                       opts->host_traddr, NULL, &ctrl->src_addr);
+               if (ret) {
+                       pr_err("malformed src address passed: %s\n",
+                              opts->host_traddr);
+                       goto out_free_ctrl;
+               }
+       }
+
+       if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) {
+               ret = -EALREADY;
+               goto out_free_ctrl;
+       }
+
+       ctrl->queues = kcalloc(opts->nr_io_queues + 1, sizeof(*ctrl->queues),
+                               GFP_KERNEL);
+       if (!ctrl->queues) {
+               ret = -ENOMEM;
+               goto out_free_ctrl;
+       }
+
+       ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0);
+       if (ret)
+               goto out_kfree_queues;
+
+       if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
+               WARN_ON_ONCE(1);
+               ret = -EINTR;
+               goto out_uninit_ctrl;
+       }
+
+       ret = nvme_tcp_setup_ctrl(&ctrl->ctrl, true);
+       if (ret)
+               goto out_uninit_ctrl;
+
+       dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+               ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
+
+       nvme_get_ctrl(&ctrl->ctrl);
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_add_tail(&ctrl->list, &nvme_tcp_ctrl_list);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       return &ctrl->ctrl;
+
+out_uninit_ctrl:
+       nvme_uninit_ctrl(&ctrl->ctrl);
+       nvme_put_ctrl(&ctrl->ctrl);
+       if (ret > 0)
+               ret = -EIO;
+       return ERR_PTR(ret);
+out_kfree_queues:
+       kfree(ctrl->queues);
+out_free_ctrl:
+       kfree(ctrl);
+       return ERR_PTR(ret);
+}
+
+static struct nvmf_transport_ops nvme_tcp_transport = {
+       .name           = "tcp",
+       .module         = THIS_MODULE,
+       .required_opts  = NVMF_OPT_TRADDR,
+       .allowed_opts   = NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
+                         NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
+                         NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST,
+       .create_ctrl    = nvme_tcp_create_ctrl,
+};
+
+static int __init nvme_tcp_init_module(void)
+{
+       nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq",
+                       WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+       if (!nvme_tcp_wq)
+               return -ENOMEM;
+
+       nvmf_register_transport(&nvme_tcp_transport);
+       return 0;
+}
+
+static void __exit nvme_tcp_cleanup_module(void)
+{
+       struct nvme_tcp_ctrl *ctrl;
+
+       nvmf_unregister_transport(&nvme_tcp_transport);
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
+               nvme_delete_ctrl(&ctrl->ctrl);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+       flush_workqueue(nvme_delete_wq);
+
+       destroy_workqueue(nvme_tcp_wq);
+}
+
+module_init(nvme_tcp_init_module);
+module_exit(nvme_tcp_cleanup_module);
+
+MODULE_LICENSE("GPL v2");
--
2.17.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* RE: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-20 23:34     ` Narayan Ayalasomayajula
  0 siblings, 0 replies; 76+ messages in thread
From: Narayan Ayalasomayajula @ 2018-11-20 23:34 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: linux-block, netdev, Keith Busch, David S. Miller, Christoph Hellwig

Hi Sagi,

>+       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
>+       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
>+       icreq->hpda = 0; /* no alignment constraint */

The NVMe-TCP spec indicates that MAXR2T is a 0's-based value. To support a single inflight R2T as indicated in the comment above, icreq->maxr2t should be set to 0, right? 

Thanks,
Narayan

-----Original Message-----
From: Linux-nvme <linux-nvme-bounces@lists.infradead.org> On Behalf Of Sagi Grimberg
Sent: Monday, November 19, 2018 7:00 PM
To: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org; netdev@vger.kernel.org; Keith Busch <keith.busch@intel.com>; David S. Miller <davem@davemloft.net>; Christoph Hellwig <hch@lst.de>
Subject: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver

[EXTERNAL EMAIL]
This email was received from outside the organization.
________________________________

From: Sagi Grimberg <sagi@lightbitslabs.com>

This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:

        nvme connect-all -t tcp -a $IPADDR

This requires the latest version of nvme-cli with TCP support.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
---
 drivers/nvme/host/Kconfig  |   15 +
 drivers/nvme/host/Makefile |    3 +
 drivers/nvme/host/tcp.c    | 2306 ++++++++++++++++++++++++++++++++++++
 3 files changed, 2324 insertions(+)
 create mode 100644 drivers/nvme/host/tcp.c

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..0f345e207675 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -57,3 +57,18 @@ config NVME_FC
          from https://github.com/linux-nvme/nvme-cli.

          If unsure, say N.
+
+config NVME_TCP
+       tristate "NVM Express over Fabrics TCP host driver"
+       depends on INET
+       depends on BLK_DEV_NVME
+       select NVME_FABRICS
+       help
+         This provides support for the NVMe over Fabrics protocol using
+         the TCP transport.  This allows you to use remote block devices
+         exported using the NVMe protocol set.
+
+         To configure a NVMe over Fabrics controller use the nvme-cli tool
+         from https://github.com/linux-nvme/nvme-cli.
+
+         If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index aea459c65ae1..8a4b671c5f0c 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME)              += nvme.o
 obj-$(CONFIG_NVME_FABRICS)             += nvme-fabrics.o
 obj-$(CONFIG_NVME_RDMA)                        += nvme-rdma.o
 obj-$(CONFIG_NVME_FC)                  += nvme-fc.o
+obj-$(CONFIG_NVME_TCP)                 += nvme-tcp.o

 nvme-core-y                            := core.o
 nvme-core-$(CONFIG_TRACING)            += trace.o
@@ -21,3 +22,5 @@ nvme-fabrics-y                                += fabrics.o
 nvme-rdma-y                            += rdma.o

 nvme-fc-y                              += fc.o
+
+nvme-tcp-y                             += tcp.o
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
new file mode 100644
index 000000000000..4c583859f0ad
--- /dev/null
+++ b/drivers/nvme/host/tcp.c
@@ -0,0 +1,2306 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP host.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/nvme-tcp.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <linux/blk-mq.h>
+#include <crypto/hash.h>
+
+#include "nvme.h"
+#include "fabrics.h"
+
+struct nvme_tcp_queue;
+
+enum nvme_tcp_send_state {
+       NVME_TCP_SEND_CMD_PDU = 0,
+       NVME_TCP_SEND_H2C_PDU,
+       NVME_TCP_SEND_DATA,
+       NVME_TCP_SEND_DDGST,
+};
+
+struct nvme_tcp_send_ctx {
+       struct bio              *curr_bio;
+       struct iov_iter         iter;
+       size_t                  offset;
+       size_t                  data_sent;
+       enum nvme_tcp_send_state state;
+};
+
+struct nvme_tcp_recv_ctx {
+       struct iov_iter         iter;
+       struct bio              *curr_bio;
+};
+
+struct nvme_tcp_request {
+       struct nvme_request     req;
+       void                    *pdu;
+       struct nvme_tcp_queue   *queue;
+       u32                     data_len;
+       u32                     pdu_len;
+       u32                     pdu_sent;
+       u16                     ttag;
+       struct list_head        entry;
+       struct nvme_tcp_recv_ctx rcv;
+       struct nvme_tcp_send_ctx snd;
+       u32                     ddgst;
+};
+
+enum nvme_tcp_queue_flags {
+       NVME_TCP_Q_ALLOCATED    = 0,
+       NVME_TCP_Q_LIVE         = 1,
+};
+
+enum nvme_tcp_recv_state {
+       NVME_TCP_RECV_PDU = 0,
+       NVME_TCP_RECV_DATA,
+       NVME_TCP_RECV_DDGST,
+};
+
+struct nvme_tcp_queue_recv_ctx {
+       char            *pdu;
+       int             pdu_remaining;
+       int             pdu_offset;
+       size_t          data_remaining;
+       size_t          ddgst_remaining;
+};
+
+struct nvme_tcp_ctrl;
+struct nvme_tcp_queue {
+       struct socket           *sock;
+       struct work_struct      io_work;
+       int                     io_cpu;
+
+       spinlock_t              lock;
+       struct list_head        send_list;
+
+       int                     queue_size;
+       size_t                  cmnd_capsule_len;
+       struct nvme_tcp_ctrl    *ctrl;
+       unsigned long           flags;
+       bool                    rd_enabled;
+
+       struct nvme_tcp_queue_recv_ctx rcv;
+       struct nvme_tcp_request *request;
+
+       bool                    hdr_digest;
+       bool                    data_digest;
+       struct ahash_request    *rcv_hash;
+       struct ahash_request    *snd_hash;
+       __le32                  exp_ddgst;
+       __le32                  recv_ddgst;
+
+       struct page_frag_cache  pf_cache;
+
+       void (*sc)(struct sock *);
+       void (*dr)(struct sock *);
+       void (*ws)(struct sock *);
+};
+
+struct nvme_tcp_ctrl {
+       /* read only in the hot path */
+       struct nvme_tcp_queue   *queues;
+       struct blk_mq_tag_set   tag_set;
+
+       /* other member variables */
+       struct list_head        list;
+       struct blk_mq_tag_set   admin_tag_set;
+       struct sockaddr_storage addr;
+       struct sockaddr_storage src_addr;
+       struct nvme_ctrl        ctrl;
+
+       struct nvme_tcp_request async_req;
+};
+
+static LIST_HEAD(nvme_tcp_ctrl_list);
+static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
+static struct workqueue_struct *nvme_tcp_wq;
+static struct blk_mq_ops nvme_tcp_mq_ops;
+static struct blk_mq_ops nvme_tcp_admin_mq_ops;
+
+static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
+{
+       return container_of(ctrl, struct nvme_tcp_ctrl, ctrl);
+}
+
+static inline int nvme_tcp_queue_id(struct nvme_tcp_queue *queue)
+{
+       return queue - queue->ctrl->queues;
+}
+
+static inline struct blk_mq_tags *nvme_tcp_tagset(struct nvme_tcp_queue *queue)
+{
+       u32 queue_idx = nvme_tcp_queue_id(queue);
+
+       if (queue_idx == 0)
+               return queue->ctrl->admin_tag_set.tags[queue_idx];
+       return queue->ctrl->tag_set.tags[queue_idx - 1];
+}
+
+static inline u8 nvme_tcp_hdgst_len(struct nvme_tcp_queue *queue)
+{
+       return queue->hdr_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline u8 nvme_tcp_ddgst_len(struct nvme_tcp_queue *queue)
+{
+       return queue->data_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline size_t nvme_tcp_inline_data_size(struct nvme_tcp_queue *queue)
+{
+       return queue->cmnd_capsule_len - sizeof(struct nvme_command);
+}
+
+static inline bool nvme_tcp_async_req(struct nvme_tcp_request *req)
+{
+       return req == &req->queue->ctrl->async_req;
+}
+
+static inline bool nvme_tcp_has_inline_data(struct nvme_tcp_request *req)
+{
+       struct request *rq;
+       unsigned int bytes;
+
+       if (unlikely(nvme_tcp_async_req(req)))
+               return false; /* async events don't have a request */
+
+       rq = blk_mq_rq_from_pdu(req);
+       bytes = blk_rq_payload_bytes(rq);
+
+       return rq_data_dir(rq) == WRITE && bytes &&
+               bytes <= nvme_tcp_inline_data_size(req->queue);
+}
+
+static inline struct page *nvme_tcp_req_cur_page(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.bvec->bv_page;
+}
+
+static inline size_t nvme_tcp_req_cur_offset(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.bvec->bv_offset + req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req)
+{
+       return min_t(size_t, req->snd.iter.bvec->bv_len - req->snd.iter.iov_offset,
+                       req->pdu_len - req->pdu_sent);
+}
+
+static inline size_t nvme_tcp_req_offset(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req)
+{
+       return rq_data_dir(blk_mq_rq_from_pdu(req)) == WRITE ?
+                       req->pdu_len - req->pdu_sent : 0;
+}
+
+static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
+               int len)
+{
+       return nvme_tcp_pdu_data_left(req) <= len;
+}
+
+static void nvme_tcp_init_send_iter(struct nvme_tcp_request *req)
+{
+       struct request *rq = blk_mq_rq_from_pdu(req);
+       struct bio_vec *vec;
+       unsigned int size;
+       int nsegs;
+       size_t offset;
+
+       if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
+               vec = &rq->special_vec;
+               nsegs = 1;
+               size = blk_rq_payload_bytes(rq);
+               offset = 0;
+       } else {
+               struct bio *bio = req->snd.curr_bio;
+
+               vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+               nsegs = bio_segments(bio);
+               size = bio->bi_iter.bi_size;
+               offset = bio->bi_iter.bi_bvec_done;
+       }
+
+       iov_iter_bvec(&req->snd.iter, WRITE, vec, nsegs, size);
+       req->snd.iter.iov_offset = offset;
+}
+
+static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
+               int len)
+{
+       req->snd.data_sent += len;
+       req->pdu_sent += len;
+       iov_iter_advance(&req->snd.iter, len);
+       if (!iov_iter_count(&req->snd.iter) &&
+           req->snd.data_sent < req->data_len) {
+               req->snd.curr_bio = req->snd.curr_bio->bi_next;
+               nvme_tcp_init_send_iter(req);
+       }
+}
+
+static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+
+       spin_lock_bh(&queue->lock);
+       list_add_tail(&req->entry, &queue->send_list);
+       spin_unlock_bh(&queue->lock);
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static inline struct nvme_tcp_request *
+nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_request *req;
+
+       spin_lock_bh(&queue->lock);
+       req = list_first_entry_or_null(&queue->send_list,
+                       struct nvme_tcp_request, entry);
+       if (req)
+               list_del(&req->entry);
+       spin_unlock_bh(&queue->lock);
+
+       return req;
+}
+
+static inline void nvme_tcp_ddgst_final(struct ahash_request *hash, u32 *dgst)
+{
+       ahash_request_set_crypt(hash, NULL, (u8 *)dgst, 0);
+       crypto_ahash_final(hash);
+}
+
+static inline void nvme_tcp_ddgst_update(struct ahash_request *hash,
+               struct page *page, off_t off, size_t len)
+{
+       struct scatterlist sg;
+
+       sg_init_marker(&sg, 1);
+       sg_set_page(&sg, page, len, off);
+       ahash_request_set_crypt(hash, &sg, NULL, len);
+       crypto_ahash_update(hash);
+}
+
+static inline void nvme_tcp_hdgst(struct ahash_request *hash,
+               void *pdu, size_t len)
+{
+       struct scatterlist sg;
+
+       sg_init_one(&sg, pdu, len);
+       ahash_request_set_crypt(hash, &sg, pdu + len, len);
+       crypto_ahash_digest(hash);
+}
+
+static int nvme_tcp_verify_hdgst(struct nvme_tcp_queue *queue,
+       void *pdu, size_t pdu_len)
+{
+       struct nvme_tcp_hdr *hdr = pdu;
+       __le32 recv_digest;
+       __le32 exp_digest;
+
+       if (unlikely(!(hdr->flags & NVME_TCP_F_HDGST))) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d: header digest flag is cleared\n",
+                       nvme_tcp_queue_id(queue));
+               return -EPROTO;
+       }
+
+       recv_digest = *(__le32 *)(pdu + hdr->hlen);
+       nvme_tcp_hdgst(queue->rcv_hash, pdu, pdu_len);
+       exp_digest = *(__le32 *)(pdu + hdr->hlen);
+       if (recv_digest != exp_digest) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "header digest error: recv %#x expected %#x\n",
+                       le32_to_cpu(recv_digest), le32_to_cpu(exp_digest));
+               return -EIO;
+       }
+
+       return 0;
+}
+
+static int nvme_tcp_check_ddgst(struct nvme_tcp_queue *queue, void *pdu)
+{
+       struct nvme_tcp_hdr *hdr = pdu;
+       u32 len;
+
+       len = le32_to_cpu(hdr->plen) - hdr->hlen -
+               ((hdr->flags & NVME_TCP_F_HDGST) ? nvme_tcp_hdgst_len(queue) : 0);
+
+       if (unlikely(len && !(hdr->flags & NVME_TCP_F_DDGST))) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d: data digest flag is cleared\n",
+               nvme_tcp_queue_id(queue));
+               return -EPROTO;
+       }
+       crypto_ahash_init(queue->rcv_hash);
+
+       return 0;
+}
+
+static void nvme_tcp_exit_request(struct blk_mq_tag_set *set,
+               struct request *rq, unsigned int hctx_idx)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+       page_frag_free(req->pdu);
+}
+
+static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
+               struct request *rq, unsigned int hctx_idx,
+               unsigned int numa_node)
+{
+       struct nvme_tcp_ctrl *ctrl = set->driver_data;
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       int queue_idx = (set == &ctrl->tag_set) ? hctx_idx + 1 : 0;
+       struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       req->pdu = page_frag_alloc(&queue->pf_cache,
+               sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+               GFP_KERNEL | __GFP_ZERO);
+       if (!req->pdu)
+               return -ENOMEM;
+
+       req->queue = queue;
+       nvme_req(rq)->ctrl = &ctrl->ctrl;
+
+       return 0;
+}
+
+static int nvme_tcp_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+               unsigned int hctx_idx)
+{
+       struct nvme_tcp_ctrl *ctrl = data;
+       struct nvme_tcp_queue *queue = &ctrl->queues[hctx_idx + 1];
+
+       BUG_ON(hctx_idx >= ctrl->ctrl.queue_count);
+
+       hctx->driver_data = queue;
+       return 0;
+}
+
+static int nvme_tcp_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+               unsigned int hctx_idx)
+{
+       struct nvme_tcp_ctrl *ctrl = data;
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+
+       BUG_ON(hctx_idx != 0);
+
+       hctx->driver_data = queue;
+       return 0;
+}
+
+static enum nvme_tcp_recv_state nvme_tcp_recv_state(struct nvme_tcp_queue *queue)
+{
+       return  (queue->rcv.pdu_remaining) ? NVME_TCP_RECV_PDU :
+               (queue->rcv.ddgst_remaining) ? NVME_TCP_RECV_DDGST :
+               NVME_TCP_RECV_DATA;
+}
+
+static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+
+       rcv->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
+                               nvme_tcp_hdgst_len(queue);
+       rcv->pdu_offset = 0;
+       rcv->data_remaining = -1;
+       rcv->ddgst_remaining = 0;
+}
+
+void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
+{
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+               return;
+
+       queue_work(nvme_wq, &ctrl->err_work);
+}
+
+static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
+               struct nvme_completion *cqe)
+{
+       struct request *rq;
+       struct nvme_tcp_request *req;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag 0x%x not found\n",
+                       nvme_tcp_queue_id(queue), cqe->command_id);
+               nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+               return -EINVAL;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       nvme_end_request(rq, cqe->status, cqe->result);
+
+       return 0;
+}
+
+static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_data_pdu *pdu)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_request *req;
+       struct request *rq;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       if (!blk_rq_payload_bytes(rq)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x unexpected data\n",
+                       nvme_tcp_queue_id(queue), rq->tag);
+               return -EIO;
+       }
+
+       rcv->data_remaining = le32_to_cpu(pdu->data_length);
+       /* No support for out-of-order */
+       WARN_ON(le32_to_cpu(pdu->data_offset));
+
+       return 0;
+
+}
+
+static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_rsp_pdu *pdu)
+{
+       struct nvme_completion *cqe = &pdu->cqe;
+       int ret = 0;
+
+       /*
+        * AEN requests are special as they don't time out and can
+        * survive any kind of queue freeze and often don't respond to
+        * aborts.  We don't even bother to allocate a struct request
+        * for them but rather special case them here.
+        */
+       if (unlikely(nvme_tcp_queue_id(queue) == 0 &&
+           cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH))
+               nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
+                               &cqe->result);
+       else
+               ret = nvme_tcp_process_nvme_cqe(queue, cqe);
+
+       return ret;
+}
+
+static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
+               struct nvme_tcp_r2t_pdu *pdu)
+{
+       struct nvme_tcp_data_pdu *data = req->pdu;
+       struct nvme_tcp_queue *queue = req->queue;
+       struct request *rq = blk_mq_rq_from_pdu(req);
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       u8 ddgst = nvme_tcp_ddgst_len(queue);
+
+       req->pdu_len = le32_to_cpu(pdu->r2t_length);
+       req->pdu_sent = 0;
+
+       if (unlikely(req->snd.data_sent + req->pdu_len > req->data_len)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "req %d r2t length %u exceeded data length %u (%zu sent)\n",
+                       rq->tag, req->pdu_len, req->data_len,
+                       req->snd.data_sent);
+               return -EPROTO;
+       }
+
+       if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->snd.data_sent)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "req %d unexpected r2t offset %u (expected %zu)\n",
+                       rq->tag, le32_to_cpu(pdu->r2t_offset),
+                       req->snd.data_sent);
+               return -EPROTO;
+       }
+
+       memset(data, 0, sizeof(*data));
+       data->hdr.type = nvme_tcp_h2c_data;
+       data->hdr.flags = NVME_TCP_F_DATA_LAST;
+       if (queue->hdr_digest)
+               data->hdr.flags |= NVME_TCP_F_HDGST;
+       if (queue->data_digest)
+               data->hdr.flags |= NVME_TCP_F_DDGST;
+       data->hdr.hlen = sizeof(*data);
+       data->hdr.pdo = data->hdr.hlen + hdgst;
+       data->hdr.plen =
+               cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
+       data->ttag = pdu->ttag;
+       data->command_id = rq->tag;
+       data->data_offset = cpu_to_le32(req->snd.data_sent);
+       data->data_length = cpu_to_le32(req->pdu_len);
+       return 0;
+}
+
+static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_r2t_pdu *pdu)
+{
+       struct nvme_tcp_request *req;
+       struct request *rq;
+       int ret;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       ret = nvme_tcp_setup_h2c_data_pdu(req, pdu);
+       if (unlikely(ret))
+               return ret;
+
+       req->snd.state = NVME_TCP_SEND_H2C_PDU;
+       req->snd.offset = 0;
+
+       nvme_tcp_queue_request(req);
+
+       return 0;
+}
+
+static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+               unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_hdr *hdr;
+       size_t rcv_len = min_t(size_t, *len, rcv->pdu_remaining);
+       int ret;
+
+       ret = skb_copy_bits(skb, *offset, &rcv->pdu[rcv->pdu_offset], rcv_len);
+       if (unlikely(ret))
+               return ret;
+
+       rcv->pdu_remaining -= rcv_len;
+       rcv->pdu_offset += rcv_len;
+       *offset += rcv_len;
+       *len -= rcv_len;
+       if (queue->rcv.pdu_remaining)
+               return 0;
+
+       hdr = (void *)rcv->pdu;
+       if (queue->hdr_digest) {
+               ret = nvme_tcp_verify_hdgst(queue, rcv->pdu, hdr->hlen);
+               if (unlikely(ret))
+                       return ret;
+       }
+
+
+       if (queue->data_digest) {
+               ret = nvme_tcp_check_ddgst(queue, rcv->pdu);
+               if (unlikely(ret))
+                       return ret;
+       }
+
+       switch (hdr->type) {
+       case nvme_tcp_c2h_data:
+               ret = nvme_tcp_handle_c2h_data(queue, (void *)rcv->pdu);
+               break;
+       case nvme_tcp_rsp:
+               nvme_tcp_init_recv_ctx(queue);
+               ret = nvme_tcp_handle_comp(queue, (void *)rcv->pdu);
+               break;
+       case nvme_tcp_r2t:
+               nvme_tcp_init_recv_ctx(queue);
+               ret = nvme_tcp_handle_r2t(queue, (void *)rcv->pdu);
+               break;
+       default:
+               dev_err(queue->ctrl->ctrl.device, "unsupported pdu type (%d)\n",
+                       hdr->type);
+               return -EINVAL;
+       }
+
+       return ret;
+}
+
+static void nvme_tcp_init_recv_iter(struct nvme_tcp_request *req)
+{
+       struct bio *bio = req->rcv.curr_bio;
+       struct bio_vec *vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+       unsigned int nsegs = bio_segments(bio);
+
+       iov_iter_bvec(&req->rcv.iter, READ, vec, nsegs,
+               bio->bi_iter.bi_size);
+       req->rcv.iter.iov_offset = bio->bi_iter.bi_bvec_done;
+}
+
+static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+                             unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_data_pdu *pdu = (void *)rcv->pdu;
+       struct nvme_tcp_request *req;
+       struct request *rq;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       while (true) {
+               int recv_len, ret;
+
+               recv_len = min_t(size_t, *len, rcv->data_remaining);
+               if (!recv_len)
+                       break;
+
+               /*
+                * FIXME: This assumes that data comes in-order,
+                *  need to handle the out-of-order case.
+                */
+               if (!iov_iter_count(&req->rcv.iter)) {
+                       req->rcv.curr_bio = req->rcv.curr_bio->bi_next;
+
+                       /*
+                        * If we don`t have any bios it means that controller
+                        * sent more data than we requested, hence error
+                        */
+                       if (!req->rcv.curr_bio) {
+                               dev_err(queue->ctrl->ctrl.device,
+                                       "queue %d no space in request %#x",
+                                       nvme_tcp_queue_id(queue), rq->tag);
+                               nvme_tcp_init_recv_ctx(queue);
+                               return -EIO;
+                       }
+                       nvme_tcp_init_recv_iter(req);
+               }
+
+               /* we can read only from what is left in this bio */
+               recv_len = min_t(size_t, recv_len,
+                               iov_iter_count(&req->rcv.iter));
+
+               if (queue->data_digest)
+                       ret = skb_copy_and_hash_datagram_iter(skb, *offset,
+                               &req->rcv.iter, recv_len, queue->rcv_hash);
+               else
+                       ret = skb_copy_datagram_iter(skb, *offset,
+                                       &req->rcv.iter, recv_len);
+               if (ret) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "queue %d failed to copy request %#x data",
+                               nvme_tcp_queue_id(queue), rq->tag);
+                       return ret;
+               }
+
+               *len -= recv_len;
+               *offset += recv_len;
+               rcv->data_remaining -= recv_len;
+       }
+
+       if (!rcv->data_remaining) {
+               if (queue->data_digest) {
+                       nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
+                       rcv->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
+               } else {
+                       nvme_tcp_init_recv_ctx(queue);
+               }
+       }
+
+       return 0;
+}
+
+static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
+               struct sk_buff *skb, unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       char *ddgst = (char *)&queue->recv_ddgst;
+       size_t recv_len = min_t(size_t, *len, rcv->ddgst_remaining);
+       off_t off = NVME_TCP_DIGEST_LENGTH - rcv->ddgst_remaining;
+       int ret;
+
+       ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
+       if (unlikely(ret))
+               return ret;
+
+       rcv->ddgst_remaining -= recv_len;
+       *offset += recv_len;
+       *len -= recv_len;
+       if (rcv->ddgst_remaining)
+               return 0;
+
+       if (queue->recv_ddgst != queue->exp_ddgst) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "data digest error: recv %#x expected %#x\n",
+                       le32_to_cpu(queue->recv_ddgst),
+                       le32_to_cpu(queue->exp_ddgst));
+               return -EIO;
+       }
+
+       nvme_tcp_init_recv_ctx(queue);
+       return 0;
+}
+
+static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+                            unsigned int offset, size_t len)
+{
+       struct nvme_tcp_queue *queue = desc->arg.data;
+       size_t consumed = len;
+       int result;
+
+       while (len) {
+               switch (nvme_tcp_recv_state(queue)) {
+               case NVME_TCP_RECV_PDU:
+                       result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
+                       break;
+               case NVME_TCP_RECV_DATA:
+                       result = nvme_tcp_recv_data(queue, skb, &offset, &len);
+                       break;
+               case NVME_TCP_RECV_DDGST:
+                       result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
+                       break;
+               default:
+                       result = -EFAULT;
+               }
+               if (result) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "receive failed:  %d\n", result);
+                       queue->rd_enabled = false;
+                       nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+                       return result;
+               }
+       }
+
+       return consumed;
+}
+
+static void nvme_tcp_data_ready(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+       if (unlikely(!queue || !queue->rd_enabled))
+               goto done;
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+done:
+       read_unlock(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_write_space(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock_bh(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+
+       if (!queue)
+               goto done;
+
+       if (sk_stream_is_writeable(sk)) {
+               clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+               queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+       }
+done:
+       read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_state_change(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+       if (!queue)
+               goto done;
+
+       switch (sk->sk_state) {
+       case TCP_CLOSE:
+       case TCP_CLOSE_WAIT:
+       case TCP_LAST_ACK:
+       case TCP_FIN_WAIT1:
+       case TCP_FIN_WAIT2:
+               /* fallthrough */
+               nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+               break;
+       default:
+               dev_info(queue->ctrl->ctrl.device,
+                       "queue %d socket state %d\n",
+                       nvme_tcp_queue_id(queue), sk->sk_state);
+       }
+
+       queue->sc(sk);
+done:
+       read_unlock(&sk->sk_callback_lock);
+}
+
+static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
+{
+       queue->request = NULL;
+}
+
+static void nvme_tcp_fail_request(struct nvme_tcp_request *req)
+{
+       union nvme_result res = {};
+
+       nvme_end_request(blk_mq_rq_from_pdu(req),
+               NVME_SC_DATA_XFER_ERROR, res);
+}
+
+static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+
+       while (true) {
+               struct page *page = nvme_tcp_req_cur_page(req);
+               size_t offset = nvme_tcp_req_cur_offset(req);
+               size_t len = nvme_tcp_req_cur_length(req);
+               bool last = nvme_tcp_pdu_last_send(req, len);
+               int ret, flags = MSG_DONTWAIT;
+
+               if (last && !queue->data_digest)
+                       flags |= MSG_EOR;
+               else
+                       flags |= MSG_MORE;
+
+               ret = kernel_sendpage(queue->sock, page, offset, len, flags);
+               if (ret <= 0)
+                       return ret;
+
+               nvme_tcp_advance_req(req, ret);
+               if (queue->data_digest)
+                       nvme_tcp_ddgst_update(queue->snd_hash, page, offset, ret);
+
+               /* fully successfull last write*/
+               if (last && ret == len) {
+                       if (queue->data_digest) {
+                               nvme_tcp_ddgst_final(queue->snd_hash,
+                                       &req->ddgst);
+                               req->snd.state = NVME_TCP_SEND_DDGST;
+                               req->snd.offset = 0;
+                       } else {
+                               nvme_tcp_done_send_req(queue);
+                       }
+                       return 1;
+               }
+       }
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       struct nvme_tcp_send_ctx *snd = &req->snd;
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       bool inline_data = nvme_tcp_has_inline_data(req);
+       int flags = MSG_DONTWAIT | (inline_data ? MSG_MORE : MSG_EOR);
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       int len = sizeof(*pdu) + hdgst - snd->offset;
+       int ret;
+
+       if (queue->hdr_digest && !snd->offset)
+               nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+       ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+                       offset_in_page(pdu) + snd->offset, len,  flags);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       len -= ret;
+       if (!len) {
+               if (inline_data) {
+                       req->snd.state = NVME_TCP_SEND_DATA;
+                       if (queue->data_digest)
+                               crypto_ahash_init(queue->snd_hash);
+                       nvme_tcp_init_send_iter(req);
+               } else {
+                       nvme_tcp_done_send_req(queue);
+               }
+               return 1;
+       }
+       snd->offset += ret;
+
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       struct nvme_tcp_send_ctx *snd = &req->snd;
+       struct nvme_tcp_data_pdu *pdu = req->pdu;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       int len = sizeof(*pdu) - snd->offset + hdgst;
+       int ret;
+
+       if (queue->hdr_digest && !snd->offset)
+               nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+       ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+                       offset_in_page(pdu) + snd->offset, len,
+                       MSG_DONTWAIT | MSG_MORE);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       len -= ret;
+       if (!len) {
+               req->snd.state = NVME_TCP_SEND_DATA;
+               if (queue->data_digest)
+                       crypto_ahash_init(queue->snd_hash);
+               if (!req->snd.data_sent)
+                       nvme_tcp_init_send_iter(req);
+               return 1;
+       }
+       snd->offset += ret;
+
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       int ret;
+       struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+       struct kvec iov = {
+               .iov_base = &req->ddgst + req->snd.offset,
+               .iov_len = NVME_TCP_DIGEST_LENGTH - req->snd.offset
+       };
+
+       ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       if (req->snd.offset + ret == NVME_TCP_DIGEST_LENGTH) {
+               nvme_tcp_done_send_req(queue);
+               return 1;
+       }
+
+       req->snd.offset += ret;
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_request *req;
+       int ret = 1;
+
+       if (!queue->request) {
+               queue->request = nvme_tcp_fetch_request(queue);
+               if (!queue->request)
+                       return 0;
+       }
+       req = queue->request;
+
+       if (req->snd.state == NVME_TCP_SEND_CMD_PDU) {
+               ret = nvme_tcp_try_send_cmd_pdu(req);
+               if (ret <= 0)
+                       goto done;
+               if (!nvme_tcp_has_inline_data(req))
+                       return ret;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_H2C_PDU) {
+               ret = nvme_tcp_try_send_data_pdu(req);
+               if (ret <= 0)
+                       goto done;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_DATA) {
+               ret = nvme_tcp_try_send_data(req);
+               if (ret <= 0)
+                       goto done;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_DDGST)
+               ret = nvme_tcp_try_send_ddgst(req);
+done:
+       if (ret == -EAGAIN)
+               ret = 0;
+       return ret;
+}
+
+static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
+{
+       struct sock *sk = queue->sock->sk;
+       read_descriptor_t rd_desc;
+       int consumed;
+
+       rd_desc.arg.data = queue;
+       rd_desc.count = 1;
+       lock_sock(sk);
+       consumed = tcp_read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+       release_sock(sk);
+       return consumed;
+}
+
+static void nvme_tcp_io_work(struct work_struct *w)
+{
+       struct nvme_tcp_queue *queue =
+               container_of(w, struct nvme_tcp_queue, io_work);
+       unsigned long start = jiffies + msecs_to_jiffies(1);
+
+       do {
+               bool pending = false;
+               int result;
+
+               result = nvme_tcp_try_send(queue);
+               if (result > 0) {
+                       pending = true;
+               } else if (unlikely(result < 0)) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "failed to send request %d\n", result);
+                       if (result != -EPIPE)
+                               nvme_tcp_fail_request(queue->request);
+                       nvme_tcp_done_send_req(queue);
+                       return;
+               }
+
+               result = nvme_tcp_try_recv(queue);
+               if (result > 0)
+                       pending = true;
+
+               if (!pending)
+                       return;
+
+       } while (time_after(jiffies, start)); /* quota is exhausted */
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
+{
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
+
+       ahash_request_free(queue->rcv_hash);
+       ahash_request_free(queue->snd_hash);
+       crypto_free_ahash(tfm);
+}
+
+static int nvme_tcp_alloc_crypto(struct nvme_tcp_queue *queue)
+{
+       struct crypto_ahash *tfm;
+
+       tfm = crypto_alloc_ahash("crc32c", 0, CRYPTO_ALG_ASYNC);
+       if (IS_ERR(tfm))
+               return PTR_ERR(tfm);
+
+       queue->snd_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+       if (!queue->snd_hash)
+               goto free_tfm;
+       ahash_request_set_callback(queue->snd_hash, 0, NULL, NULL);
+
+       queue->rcv_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+       if (!queue->rcv_hash)
+               goto free_snd_hash;
+       ahash_request_set_callback(queue->rcv_hash, 0, NULL, NULL);
+
+       return 0;
+free_snd_hash:
+       ahash_request_free(queue->snd_hash);
+free_tfm:
+       crypto_free_ahash(tfm);
+       return -ENOMEM;
+}
+
+static void nvme_tcp_free_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+       struct nvme_tcp_request *async = &ctrl->async_req;
+
+       page_frag_free(async->pdu);
+}
+
+static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+       struct nvme_tcp_request *async = &ctrl->async_req;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       async->pdu = page_frag_alloc(&queue->pf_cache,
+               sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+               GFP_KERNEL | __GFP_ZERO);
+       if (!async->pdu)
+               return -ENOMEM;
+
+       async->queue = &ctrl->queues[0];
+       return 0;
+}
+
+static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+       if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
+               return;
+
+       if (queue->hdr_digest || queue->data_digest)
+               nvme_tcp_free_crypto(queue);
+
+       sock_release(queue->sock);
+       kfree(queue->rcv.pdu);
+}
+
+static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_icreq_pdu *icreq;
+       struct nvme_tcp_icresp_pdu *icresp;
+       struct msghdr msg = {};
+       struct kvec iov;
+       bool ctrl_hdgst, ctrl_ddgst;
+       int ret;
+
+       icreq = kzalloc(sizeof(*icreq), GFP_KERNEL);
+       if (!icreq)
+               return -ENOMEM;
+
+       icresp = kzalloc(sizeof(*icresp), GFP_KERNEL);
+       if (!icresp) {
+               ret = -ENOMEM;
+               goto free_icreq;
+       }
+
+       icreq->hdr.type = nvme_tcp_icreq;
+       icreq->hdr.hlen = sizeof(*icreq);
+       icreq->hdr.pdo = 0;
+       icreq->hdr.plen = cpu_to_le32(icreq->hdr.hlen);
+       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
+       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
+       icreq->hpda = 0; /* no alignment constraint */
+       if (queue->hdr_digest)
+               icreq->digest |= NVME_TCP_HDR_DIGEST_ENABLE;
+       if (queue->data_digest)
+               icreq->digest |= NVME_TCP_DATA_DIGEST_ENABLE;
+
+       iov.iov_base = icreq;
+       iov.iov_len = sizeof(*icreq);
+       ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+       if (ret < 0)
+               goto free_icresp;
+
+       memset(&msg, 0, sizeof(msg));
+       iov.iov_base = icresp;
+       iov.iov_len = sizeof(*icresp);
+       ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+                       iov.iov_len, msg.msg_flags);
+       if (ret < 0)
+               goto free_icresp;
+
+       ret = -EINVAL;
+       if (icresp->hdr.type != nvme_tcp_icresp) {
+               pr_err("queue %d: bad type returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->hdr.type);
+               goto free_icresp;
+       }
+
+       if (le32_to_cpu(icresp->hdr.plen) != sizeof(*icresp)) {
+               pr_err("queue %d: bad pdu length returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->hdr.plen);
+               goto free_icresp;
+       }
+
+       if (icresp->pfv != NVME_TCP_PFV_1_0) {
+               pr_err("queue %d: bad pfv returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->pfv);
+               goto free_icresp;
+       }
+
+       ctrl_ddgst = !!(icresp->digest & NVME_TCP_DATA_DIGEST_ENABLE);
+       if ((queue->data_digest && !ctrl_ddgst) ||
+           (!queue->data_digest && ctrl_ddgst)) {
+               pr_err("queue %d: data digest mismatch host: %s ctrl: %s\n",
+                       nvme_tcp_queue_id(queue),
+                       queue->data_digest ? "enabled" : "disabled",
+                       ctrl_ddgst ? "enabled" : "disabled");
+               goto free_icresp;
+       }
+
+       ctrl_hdgst = !!(icresp->digest & NVME_TCP_HDR_DIGEST_ENABLE);
+       if ((queue->hdr_digest && !ctrl_hdgst) ||
+           (!queue->hdr_digest && ctrl_hdgst)) {
+               pr_err("queue %d: header digest mismatch host: %s ctrl: %s\n",
+                       nvme_tcp_queue_id(queue),
+                       queue->hdr_digest ? "enabled" : "disabled",
+                       ctrl_hdgst ? "enabled" : "disabled");
+               goto free_icresp;
+       }
+
+       if (icresp->cpda != 0) {
+               pr_err("queue %d: unsupported cpda returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->cpda);
+               goto free_icresp;
+       }
+
+       ret = 0;
+free_icresp:
+       kfree(icresp);
+free_icreq:
+       kfree(icreq);
+       return ret;
+}
+
+static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
+               int qid, size_t queue_size)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+       struct linger sol = { .l_onoff = 1, .l_linger = 0 };
+       int ret, opt, rcv_pdu_size;
+
+       queue->ctrl = ctrl;
+       INIT_LIST_HEAD(&queue->send_list);
+       spin_lock_init(&queue->lock);
+       INIT_WORK(&queue->io_work, nvme_tcp_io_work);
+       queue->queue_size = queue_size;
+
+       if (qid > 0)
+               queue->cmnd_capsule_len = ctrl->ctrl.ioccsz * 16;
+       else
+               queue->cmnd_capsule_len = sizeof(struct nvme_command) +
+                                               NVME_TCP_ADMIN_CCSZ;
+
+       ret = sock_create(ctrl->addr.ss_family, SOCK_STREAM,
+                       IPPROTO_TCP, &queue->sock);
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to create socket: %d\n", ret);
+               return ret;
+       }
+
+       /* Single syn retry */
+       opt = 1;
+       ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_SYNCNT,
+                       (char *)&opt, sizeof(opt));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set TCP_SYNCNT sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       /* Set TCP no delay */
+       opt = 1;
+       ret = kernel_setsockopt(queue->sock, IPPROTO_TCP,
+                       TCP_NODELAY, (char *)&opt, sizeof(opt));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set TCP_NODELAY sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       /*
+        * Cleanup whatever is sitting in the TCP transmit queue on socket
+        * close. This is done to prevent stale data from being sent should
+        * the network connection be restored before TCP times out.
+        */
+       ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_LINGER,
+                       (char *)&sol, sizeof(sol));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set SO_LINGER sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       queue->sock->sk->sk_allocation = GFP_ATOMIC;
+       queue->io_cpu = (qid == 0) ? 0 : qid - 1;
+       queue->request = NULL;
+       queue->rcv.data_remaining = 0;
+       queue->rcv.ddgst_remaining = 0;
+       queue->rcv.pdu_remaining = 0;
+       queue->rcv.pdu_offset = 0;
+       sk_set_memalloc(queue->sock->sk);
+
+       if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR) {
+               ret = kernel_bind(queue->sock, (struct sockaddr *)&ctrl->src_addr,
+                       sizeof(ctrl->src_addr));
+               if (ret) {
+                       dev_err(ctrl->ctrl.device,
+                               "failed to bind queue %d socket %d\n",
+                               qid, ret);
+                       goto err_sock;
+               }
+       }
+
+       queue->hdr_digest = nctrl->opts->hdr_digest;
+       queue->data_digest = nctrl->opts->data_digest;
+       if (queue->hdr_digest || queue->data_digest) {
+               ret = nvme_tcp_alloc_crypto(queue);
+               if (ret) {
+                       dev_err(ctrl->ctrl.device,
+                               "failed to allocate queue %d crypto\n", qid);
+                       goto err_sock;
+               }
+       }
+
+       rcv_pdu_size = sizeof(struct nvme_tcp_rsp_pdu) +
+                       nvme_tcp_hdgst_len(queue);
+       queue->rcv.pdu = kmalloc(rcv_pdu_size, GFP_KERNEL);
+       if (!queue->rcv.pdu) {
+               ret = -ENOMEM;
+               goto err_crypto;
+       }
+
+       dev_dbg(ctrl->ctrl.device, "connecting queue %d\n",
+                       nvme_tcp_queue_id(queue));
+
+       ret = kernel_connect(queue->sock, (struct sockaddr *)&ctrl->addr,
+               sizeof(ctrl->addr), 0);
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to connect socket: %d\n", ret);
+               goto err_rcv_pdu;
+       }
+
+       ret = nvme_tcp_init_connection(queue);
+       if (ret)
+               goto err_init_connect;
+
+       queue->rd_enabled = true;
+       set_bit(NVME_TCP_Q_ALLOCATED, &queue->flags);
+       nvme_tcp_init_recv_ctx(queue);
+
+       write_lock_bh(&queue->sock->sk->sk_callback_lock);
+       queue->sock->sk->sk_user_data = queue;
+       queue->sc = queue->sock->sk->sk_state_change;
+       queue->dr = queue->sock->sk->sk_data_ready;
+       queue->ws = queue->sock->sk->sk_write_space;
+       queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
+       queue->sock->sk->sk_state_change = nvme_tcp_state_change;
+       queue->sock->sk->sk_write_space = nvme_tcp_write_space;
+       write_unlock_bh(&queue->sock->sk->sk_callback_lock);
+
+       return 0;
+
+err_init_connect:
+       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+err_rcv_pdu:
+       kfree(queue->rcv.pdu);
+err_crypto:
+       if (queue->hdr_digest || queue->data_digest)
+               nvme_tcp_free_crypto(queue);
+err_sock:
+       sock_release(queue->sock);
+       queue->sock = NULL;
+       return ret;
+}
+
+static void nvme_tcp_restore_sock_calls(struct nvme_tcp_queue *queue)
+{
+       struct socket *sock = queue->sock;
+
+       write_lock_bh(&sock->sk->sk_callback_lock);
+       sock->sk->sk_user_data  = NULL;
+       sock->sk->sk_data_ready = queue->dr;
+       sock->sk->sk_state_change = queue->sc;
+       sock->sk->sk_write_space  = queue->ws;
+       write_unlock_bh(&sock->sk->sk_callback_lock);
+}
+
+static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
+{
+       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+       nvme_tcp_restore_sock_calls(queue);
+       cancel_work_sync(&queue->io_work);
+}
+
+static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+       if (!test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags))
+               return;
+
+       __nvme_tcp_stop_queue(queue);
+}
+
+static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       int ret;
+
+       if (idx)
+               ret = nvmf_connect_io_queue(nctrl, idx);
+       else
+               ret = nvmf_connect_admin_queue(nctrl);
+
+       if (!ret) {
+               set_bit(NVME_TCP_Q_LIVE, &ctrl->queues[idx].flags);
+       } else {
+               __nvme_tcp_stop_queue(&ctrl->queues[idx]);
+               dev_err(nctrl->device,
+                       "failed to connect queue: %d ret=%d\n", idx, ret);
+       }
+       return ret;
+}
+
+static void nvme_tcp_free_tagset(struct nvme_ctrl *nctrl,
+               struct blk_mq_tag_set *set)
+{
+       blk_mq_free_tag_set(set);
+}
+
+static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
+               bool admin)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct blk_mq_tag_set *set;
+       int ret;
+
+       if (admin) {
+               set = &ctrl->admin_tag_set;
+               memset(set, 0, sizeof(*set));
+               set->ops = &nvme_tcp_admin_mq_ops;
+               set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+               set->reserved_tags = 2; /* connect + keep-alive */
+               set->numa_node = NUMA_NO_NODE;
+               set->cmd_size = sizeof(struct nvme_tcp_request);
+               set->driver_data = ctrl;
+               set->nr_hw_queues = 1;
+               set->timeout = ADMIN_TIMEOUT;
+       } else {
+               set = &ctrl->tag_set;
+               memset(set, 0, sizeof(*set));
+               set->ops = &nvme_tcp_mq_ops;
+               set->queue_depth = nctrl->sqsize + 1;
+               set->reserved_tags = 1; /* fabric connect */
+               set->numa_node = NUMA_NO_NODE;
+               set->flags = BLK_MQ_F_SHOULD_MERGE;
+               set->cmd_size = sizeof(struct nvme_tcp_request);
+               set->driver_data = ctrl;
+               set->nr_hw_queues = nctrl->queue_count - 1;
+               set->timeout = NVME_IO_TIMEOUT;
+       }
+
+       ret = blk_mq_alloc_tag_set(set);
+       if (ret)
+               return ERR_PTR(ret);
+
+       return set;
+}
+
+static void nvme_tcp_free_admin_queue(struct nvme_ctrl *ctrl)
+{
+       if (to_tcp_ctrl(ctrl)->async_req.pdu) {
+               nvme_tcp_free_async_req(to_tcp_ctrl(ctrl));
+               to_tcp_ctrl(ctrl)->async_req.pdu = NULL;
+       }
+
+       nvme_tcp_free_queue(ctrl, 0);
+}
+
+static void nvme_tcp_free_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i;
+
+       for (i = 1; i < ctrl->queue_count; i++)
+               nvme_tcp_free_queue(ctrl, i);
+}
+
+static void nvme_tcp_stop_admin_queue(struct nvme_ctrl *ctrl)
+{
+       nvme_tcp_stop_queue(ctrl, 0);
+}
+
+static void nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i;
+
+       for (i = 1; i < ctrl->queue_count; i++)
+               nvme_tcp_stop_queue(ctrl, i);
+}
+
+static int nvme_tcp_start_admin_queue(struct nvme_ctrl *ctrl)
+{
+       return nvme_tcp_start_queue(ctrl, 0);
+}
+
+static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i, ret = 0;
+
+       for (i = 1; i < ctrl->queue_count; i++) {
+               ret = nvme_tcp_start_queue(ctrl, i);
+               if (ret)
+                       goto out_stop_queues;
+       }
+
+       return 0;
+
+out_stop_queues:
+       for (i--; i >= 1; i--)
+               nvme_tcp_stop_queue(ctrl, i);
+       return ret;
+}
+
+static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
+{
+       int ret;
+
+       ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
+       if (ret)
+               return ret;
+
+       ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
+       if (ret)
+               goto out_free_queue;
+
+       return 0;
+
+out_free_queue:
+       nvme_tcp_free_queue(ctrl, 0);
+       return ret;
+}
+
+static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i, ret;
+
+       for (i = 1; i < ctrl->queue_count; i++) {
+               ret = nvme_tcp_alloc_queue(ctrl, i,
+                               ctrl->sqsize + 1);
+               if (ret)
+                       goto out_free_queues;
+       }
+
+       return 0;
+
+out_free_queues:
+       for (i--; i >= 1; i--)
+               nvme_tcp_free_queue(ctrl, i);
+
+       return ret;
+}
+
+static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl)
+{
+       return min(ctrl->queue_count - 1, num_online_cpus());
+}
+
+static int nvme_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+       unsigned int nr_io_queues;
+       int ret;
+
+       nr_io_queues = nvme_tcp_nr_io_queues(ctrl);
+       ret = nvme_set_queue_count(ctrl, &nr_io_queues);
+       if (ret)
+               return ret;
+
+       ctrl->queue_count = nr_io_queues + 1;
+       if (ctrl->queue_count < 2)
+               return 0;
+
+       dev_info(ctrl->device,
+               "creating %d I/O queues.\n", nr_io_queues);
+
+       return nvme_tcp_alloc_io_queues(ctrl);
+}
+
+void nvme_tcp_destroy_io_queues(struct nvme_ctrl *ctrl, bool remove)
+{
+       nvme_tcp_stop_io_queues(ctrl);
+       if (remove) {
+               if (ctrl->ops->flags & NVME_F_FABRICS)
+                       blk_cleanup_queue(ctrl->connect_q);
+               nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+       }
+       nvme_tcp_free_io_queues(ctrl);
+}
+
+int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
+{
+       int ret;
+
+       ret = nvme_alloc_io_queues(ctrl);
+       if (ret)
+               return ret;
+
+       if (new) {
+               ctrl->tagset = nvme_tcp_alloc_tagset(ctrl, false);
+               if (IS_ERR(ctrl->tagset)) {
+                       ret = PTR_ERR(ctrl->tagset);
+                       goto out_free_io_queues;
+               }
+
+               if (ctrl->ops->flags & NVME_F_FABRICS) {
+                       ctrl->connect_q = blk_mq_init_queue(ctrl->tagset);
+                       if (IS_ERR(ctrl->connect_q)) {
+                               ret = PTR_ERR(ctrl->connect_q);
+                               goto out_free_tag_set;
+                       }
+               }
+       } else {
+               blk_mq_update_nr_hw_queues(ctrl->tagset,
+                       ctrl->queue_count - 1);
+       }
+
+       ret = nvme_tcp_start_io_queues(ctrl);
+       if (ret)
+               goto out_cleanup_connect_q;
+
+       return 0;
+
+out_cleanup_connect_q:
+       if (new && (ctrl->ops->flags & NVME_F_FABRICS))
+               blk_cleanup_queue(ctrl->connect_q);
+out_free_tag_set:
+       if (new)
+               nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+out_free_io_queues:
+       nvme_tcp_free_io_queues(ctrl);
+       return ret;
+}
+
+void nvme_tcp_destroy_admin_queue(struct nvme_ctrl *ctrl, bool remove)
+{
+       nvme_tcp_stop_admin_queue(ctrl);
+       if (remove) {
+               free_opal_dev(ctrl->opal_dev);
+               blk_cleanup_queue(ctrl->admin_q);
+               nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+       }
+       nvme_tcp_free_admin_queue(ctrl);
+}
+
+int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
+{
+       int error;
+
+       error = nvme_tcp_alloc_admin_queue(ctrl);
+       if (error)
+               return error;
+
+       if (new) {
+               ctrl->admin_tagset = nvme_tcp_alloc_tagset(ctrl, true);
+               if (IS_ERR(ctrl->admin_tagset)) {
+                       error = PTR_ERR(ctrl->admin_tagset);
+                       goto out_free_queue;
+               }
+
+               ctrl->admin_q = blk_mq_init_queue(ctrl->admin_tagset);
+               if (IS_ERR(ctrl->admin_q)) {
+                       error = PTR_ERR(ctrl->admin_q);
+                       goto out_free_tagset;
+               }
+       }
+
+       error = nvme_tcp_start_admin_queue(ctrl);
+       if (error)
+               goto out_cleanup_queue;
+
+       error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap);
+       if (error) {
+               dev_err(ctrl->device,
+                       "prop_get NVME_REG_CAP failed\n");
+               goto out_stop_queue;
+       }
+
+       ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize);
+
+       error = nvme_enable_ctrl(ctrl, ctrl->cap);
+       if (error)
+               goto out_stop_queue;
+
+       error = nvme_init_identify(ctrl);
+       if (error)
+               goto out_stop_queue;
+
+       return 0;
+
+out_stop_queue:
+       nvme_tcp_stop_admin_queue(ctrl);
+out_cleanup_queue:
+       if (new)
+               blk_cleanup_queue(ctrl->admin_q);
+out_free_tagset:
+       if (new)
+               nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+out_free_queue:
+       nvme_tcp_free_admin_queue(ctrl);
+       return error;
+}
+
+static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
+               bool remove)
+{
+       blk_mq_quiesce_queue(ctrl->admin_q);
+       nvme_tcp_stop_admin_queue(ctrl);
+       blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl);
+       blk_mq_unquiesce_queue(ctrl->admin_q);
+       nvme_tcp_destroy_admin_queue(ctrl, remove);
+}
+
+static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
+               bool remove)
+{
+       if (ctrl->queue_count > 1) {
+               nvme_stop_queues(ctrl);
+               nvme_tcp_stop_io_queues(ctrl);
+               blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl);
+               if (remove)
+                       nvme_start_queues(ctrl);
+               nvme_tcp_destroy_io_queues(ctrl, remove);
+       }
+}
+
+static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
+{
+       /* If we are resetting/deleting then do nothing */
+       if (ctrl->state != NVME_CTRL_CONNECTING) {
+               WARN_ON_ONCE(ctrl->state == NVME_CTRL_NEW ||
+                       ctrl->state == NVME_CTRL_LIVE);
+               return;
+       }
+
+       if (nvmf_should_reconnect(ctrl)) {
+               dev_info(ctrl->device, "Reconnecting in %d seconds...\n",
+                       ctrl->opts->reconnect_delay);
+               queue_delayed_work(nvme_wq, &ctrl->connect_work,
+                               ctrl->opts->reconnect_delay * HZ);
+       } else {
+               dev_info(ctrl->device, "Removing controller...\n");
+               nvme_delete_ctrl(ctrl);
+       }
+}
+
+static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
+{
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+       int ret = -EINVAL;
+
+       ret = nvme_tcp_configure_admin_queue(ctrl, new);
+       if (ret)
+               return ret;
+
+       if (ctrl->icdoff) {
+               dev_err(ctrl->device, "icdoff is not supported!\n");
+               goto destroy_admin;
+       }
+
+       if (opts->queue_size > ctrl->sqsize + 1)
+               dev_warn(ctrl->device,
+                       "queue_size %zu > ctrl sqsize %u, clamping down\n",
+                       opts->queue_size, ctrl->sqsize + 1);
+
+       if (ctrl->sqsize + 1 > ctrl->maxcmd) {
+               dev_warn(ctrl->device,
+                       "sqsize %u > ctrl maxcmd %u, clamping down\n",
+                       ctrl->sqsize + 1, ctrl->maxcmd);
+               ctrl->sqsize = ctrl->maxcmd - 1;
+       }
+
+       if (ctrl->queue_count > 1) {
+               ret = nvme_tcp_configure_io_queues(ctrl, new);
+               if (ret)
+                       goto destroy_admin;
+       }
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               ret = -EINVAL;
+               goto destroy_io;
+       }
+
+       nvme_start_ctrl(ctrl);
+       return 0;
+
+destroy_io:
+       if (ctrl->queue_count > 1)
+               nvme_tcp_destroy_io_queues(ctrl, new);
+destroy_admin:
+       nvme_tcp_stop_admin_queue(ctrl);
+       nvme_tcp_destroy_admin_queue(ctrl, new);
+       return ret;
+}
+
+static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+                       struct nvme_ctrl, connect_work);
+
+       ++ctrl->nr_reconnects;
+
+       if (nvme_tcp_setup_ctrl(ctrl, false))
+               goto requeue;
+
+       dev_info(ctrl->device, "Successfully reconnected (%d attepmpt)\n",
+                       ctrl->nr_reconnects);
+
+       ctrl->nr_reconnects = 0;
+
+       return;
+
+requeue:
+       dev_info(ctrl->device, "Failed reconnect attempt %d\n",
+                       ctrl->nr_reconnects);
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_error_recovery_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl = container_of(work,
+                       struct nvme_ctrl, err_work);
+
+       nvme_stop_keep_alive(ctrl);
+       nvme_tcp_teardown_io_queues(ctrl, false);
+       /* unquiesce to fail fast pending requests */
+       nvme_start_queues(ctrl);
+       nvme_tcp_teardown_admin_queue(ctrl, false);
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               return;
+       }
+
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
+{
+       nvme_tcp_teardown_io_queues(ctrl, shutdown);
+       if (shutdown)
+               nvme_shutdown_ctrl(ctrl);
+       else
+               nvme_disable_ctrl(ctrl, ctrl->cap);
+       nvme_tcp_teardown_admin_queue(ctrl, shutdown);
+}
+
+static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
+{
+       nvme_tcp_teardown_ctrl(ctrl, true);
+}
+
+static void nvme_reset_ctrl_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl =
+               container_of(work, struct nvme_ctrl, reset_work);
+
+       nvme_stop_ctrl(ctrl);
+       nvme_tcp_teardown_ctrl(ctrl, false);
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               return;
+       }
+
+       if (nvme_tcp_setup_ctrl(ctrl, false))
+               goto out_fail;
+
+       return;
+
+out_fail:
+       ++ctrl->nr_reconnects;
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
+{
+       cancel_work_sync(&ctrl->err_work);
+       cancel_delayed_work_sync(&ctrl->connect_work);
+}
+
+static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+
+       if (list_empty(&ctrl->list))
+               goto free_ctrl;
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_del(&ctrl->list);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       nvmf_free_options(nctrl->opts);
+free_ctrl:
+       kfree(ctrl->queues);
+       kfree(ctrl);
+}
+
+static void nvme_tcp_set_sg_null(struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = 0;
+       sg->length = 0;
+       sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+                       NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_request *req, struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
+       sg->length = cpu_to_le32(req->data_len);
+       sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
+}
+
+static void nvme_tcp_set_sg_host_data(struct nvme_tcp_request *req,
+               struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = 0;
+       sg->length = cpu_to_le32(req->data_len);
+       sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+                       NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(arg);
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+       struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
+       struct nvme_command *cmd = &pdu->cmd;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       memset(pdu, 0, sizeof(*pdu));
+       pdu->hdr.type = nvme_tcp_cmd;
+       if (queue->hdr_digest)
+               pdu->hdr.flags |= NVME_TCP_F_HDGST;
+       pdu->hdr.hlen = sizeof(*pdu);
+       pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+
+       cmd->common.opcode = nvme_admin_async_event;
+       cmd->common.command_id = NVME_AQ_BLK_MQ_DEPTH;
+       cmd->common.flags |= NVME_CMD_SGL_METABUF;
+       nvme_tcp_set_sg_null(cmd);
+
+       ctrl->async_req.snd.state = NVME_TCP_SEND_CMD_PDU;
+       ctrl->async_req.snd.offset = 0;
+       ctrl->async_req.snd.curr_bio = NULL;
+       ctrl->async_req.rcv.curr_bio = NULL;
+       ctrl->async_req.data_len = 0;
+
+       nvme_tcp_queue_request(&ctrl->async_req);
+}
+
+static enum blk_eh_timer_return
+nvme_tcp_timeout(struct request *rq, bool reserved)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+
+       dev_dbg(ctrl->ctrl.device,
+               "queue %d: timeout request %#x type %d\n",
+               nvme_tcp_queue_id(req->queue), rq->tag,
+               pdu->hdr.type);
+
+       if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+               union nvme_result res = {};
+
+               nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+               nvme_end_request(rq, NVME_SC_ABORT_REQ, res);
+               return BLK_EH_DONE;
+       }
+
+       /* queue error recovery */
+       nvme_tcp_error_recovery(&ctrl->ctrl);
+
+       return BLK_EH_RESET_TIMER;
+}
+
+static blk_status_t nvme_tcp_map_data(struct nvme_tcp_queue *queue,
+                       struct request *rq)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       struct nvme_command *c = &pdu->cmd;
+
+       c->common.flags |= NVME_CMD_SGL_METABUF;
+
+       if (!req->data_len) {
+               nvme_tcp_set_sg_null(c);
+               return 0;
+       }
+
+       if (rq_data_dir(rq) == WRITE &&
+           req->data_len <= nvme_tcp_inline_data_size(queue))
+               nvme_tcp_set_sg_inline(queue, req, c);
+       else
+               nvme_tcp_set_sg_host_data(req, c);
+
+       return 0;
+}
+
+static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
+               struct request *rq)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       struct nvme_tcp_queue *queue = req->queue;
+       u8 hdgst = nvme_tcp_hdgst_len(queue), ddgst = 0;
+       blk_status_t ret;
+
+       ret = nvme_setup_cmd(ns, rq, &pdu->cmd);
+       if (ret)
+               return ret;
+
+       req->snd.state = NVME_TCP_SEND_CMD_PDU;
+       req->snd.offset = 0;
+       req->snd.data_sent = 0;
+       req->pdu_len = 0;
+       req->pdu_sent = 0;
+       req->data_len = blk_rq_payload_bytes(rq);
+
+       if (rq_data_dir(rq) == WRITE) {
+               req->snd.curr_bio = rq->bio;
+               if (req->data_len <= nvme_tcp_inline_data_size(queue))
+                       req->pdu_len = req->data_len;
+       } else {
+               req->rcv.curr_bio = rq->bio;
+               if (req->rcv.curr_bio)
+                       nvme_tcp_init_recv_iter(req);
+       }
+
+       pdu->hdr.type = nvme_tcp_cmd;
+       pdu->hdr.flags = 0;
+       if (queue->hdr_digest)
+               pdu->hdr.flags |= NVME_TCP_F_HDGST;
+       if (queue->data_digest && req->pdu_len) {
+               pdu->hdr.flags |= NVME_TCP_F_DDGST;
+               ddgst = nvme_tcp_ddgst_len(queue);
+       }
+       pdu->hdr.hlen = sizeof(*pdu);
+       pdu->hdr.pdo = req->pdu_len ? pdu->hdr.hlen + hdgst : 0;
+       pdu->hdr.plen =
+               cpu_to_le32(pdu->hdr.hlen + hdgst + req->pdu_len + ddgst);
+
+       ret = nvme_tcp_map_data(queue, rq);
+       if (unlikely(ret)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "Failed to map data (%d)\n", ret);
+               return ret;
+       }
+
+       return 0;
+}
+
+static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
+               const struct blk_mq_queue_data *bd)
+{
+       struct nvme_ns *ns = hctx->queue->queuedata;
+       struct nvme_tcp_queue *queue = hctx->driver_data;
+       struct request *rq = bd->rq;
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
+       blk_status_t ret;
+
+       if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
+               return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
+
+       ret = nvme_tcp_setup_cmd_pdu(ns, rq);
+       if (unlikely(ret))
+               return ret;
+
+       blk_mq_start_request(rq);
+
+       nvme_tcp_queue_request(req);
+
+       return BLK_STS_OK;
+}
+
+static struct blk_mq_ops nvme_tcp_mq_ops = {
+       .queue_rq       = nvme_tcp_queue_rq,
+       .complete       = nvme_complete_rq,
+       .init_request   = nvme_tcp_init_request,
+       .exit_request   = nvme_tcp_exit_request,
+       .init_hctx      = nvme_tcp_init_hctx,
+       .timeout        = nvme_tcp_timeout,
+};
+
+static struct blk_mq_ops nvme_tcp_admin_mq_ops = {
+       .queue_rq       = nvme_tcp_queue_rq,
+       .complete       = nvme_complete_rq,
+       .init_request   = nvme_tcp_init_request,
+       .exit_request   = nvme_tcp_exit_request,
+       .init_hctx      = nvme_tcp_init_admin_hctx,
+       .timeout        = nvme_tcp_timeout,
+};
+
+static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
+       .name                   = "tcp",
+       .module                 = THIS_MODULE,
+       .flags                  = NVME_F_FABRICS,
+       .reg_read32             = nvmf_reg_read32,
+       .reg_read64             = nvmf_reg_read64,
+       .reg_write32            = nvmf_reg_write32,
+       .free_ctrl              = nvme_tcp_free_ctrl,
+       .submit_async_event     = nvme_tcp_submit_async_event,
+       .delete_ctrl            = nvme_tcp_delete_ctrl,
+       .get_address            = nvmf_get_address,
+       .stop_ctrl              = nvme_tcp_stop_ctrl,
+};
+
+static bool
+nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts)
+{
+       struct nvme_tcp_ctrl *ctrl;
+       bool found = false;
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
+               found = nvmf_ip_options_match(&ctrl->ctrl, opts);
+               if (found)
+                       break;
+       }
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       return found;
+}
+
+static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
+               struct nvmf_ctrl_options *opts)
+{
+       struct nvme_tcp_ctrl *ctrl;
+       int ret;
+
+       ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
+       if (!ctrl)
+               return ERR_PTR(-ENOMEM);
+
+       INIT_LIST_HEAD(&ctrl->list);
+       ctrl->ctrl.opts = opts;
+       ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
+       ctrl->ctrl.sqsize = opts->queue_size - 1;
+       ctrl->ctrl.kato = opts->kato;
+
+       INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
+                       nvme_tcp_reconnect_ctrl_work);
+       INIT_WORK(&ctrl->ctrl.err_work, nvme_tcp_error_recovery_work);
+       INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
+
+       if (!(opts->mask & NVMF_OPT_TRSVCID)) {
+               opts->trsvcid =
+                       kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
+               if (!opts->trsvcid) {
+                       ret = -ENOMEM;
+                       goto out_free_ctrl;
+               }
+               opts->mask |= NVMF_OPT_TRSVCID;
+       }
+
+       ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+                       opts->traddr, opts->trsvcid, &ctrl->addr);
+       if (ret) {
+               pr_err("malformed address passed: %s:%s\n",
+                       opts->traddr, opts->trsvcid);
+               goto out_free_ctrl;
+       }
+
+       if (opts->mask & NVMF_OPT_HOST_TRADDR) {
+               ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+                       opts->host_traddr, NULL, &ctrl->src_addr);
+               if (ret) {
+                       pr_err("malformed src address passed: %s\n",
+                              opts->host_traddr);
+                       goto out_free_ctrl;
+               }
+       }
+
+       if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) {
+               ret = -EALREADY;
+               goto out_free_ctrl;
+       }
+
+       ctrl->queues = kcalloc(opts->nr_io_queues + 1, sizeof(*ctrl->queues),
+                               GFP_KERNEL);
+       if (!ctrl->queues) {
+               ret = -ENOMEM;
+               goto out_free_ctrl;
+       }
+
+       ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0);
+       if (ret)
+               goto out_kfree_queues;
+
+       if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
+               WARN_ON_ONCE(1);
+               ret = -EINTR;
+               goto out_uninit_ctrl;
+       }
+
+       ret = nvme_tcp_setup_ctrl(&ctrl->ctrl, true);
+       if (ret)
+               goto out_uninit_ctrl;
+
+       dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+               ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
+
+       nvme_get_ctrl(&ctrl->ctrl);
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_add_tail(&ctrl->list, &nvme_tcp_ctrl_list);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       return &ctrl->ctrl;
+
+out_uninit_ctrl:
+       nvme_uninit_ctrl(&ctrl->ctrl);
+       nvme_put_ctrl(&ctrl->ctrl);
+       if (ret > 0)
+               ret = -EIO;
+       return ERR_PTR(ret);
+out_kfree_queues:
+       kfree(ctrl->queues);
+out_free_ctrl:
+       kfree(ctrl);
+       return ERR_PTR(ret);
+}
+
+static struct nvmf_transport_ops nvme_tcp_transport = {
+       .name           = "tcp",
+       .module         = THIS_MODULE,
+       .required_opts  = NVMF_OPT_TRADDR,
+       .allowed_opts   = NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
+                         NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
+                         NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST,
+       .create_ctrl    = nvme_tcp_create_ctrl,
+};
+
+static int __init nvme_tcp_init_module(void)
+{
+       nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq",
+                       WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+       if (!nvme_tcp_wq)
+               return -ENOMEM;
+
+       nvmf_register_transport(&nvme_tcp_transport);
+       return 0;
+}
+
+static void __exit nvme_tcp_cleanup_module(void)
+{
+       struct nvme_tcp_ctrl *ctrl;
+
+       nvmf_unregister_transport(&nvme_tcp_transport);
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
+               nvme_delete_ctrl(&ctrl->ctrl);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+       flush_workqueue(nvme_delete_wq);
+
+       destroy_workqueue(nvme_tcp_wq);
+}
+
+module_init(nvme_tcp_init_module);
+module_exit(nvme_tcp_cleanup_module);
+
+MODULE_LICENSE("GPL v2");

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-20 23:34     ` Narayan Ayalasomayajula
  0 siblings, 0 replies; 76+ messages in thread
From: Narayan Ayalasomayajula @ 2018-11-20 23:34 UTC (permalink / raw)


Hi Sagi,

>+       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
>+       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
>+       icreq->hpda = 0; /* no alignment constraint */

The NVMe-TCP spec indicates that MAXR2T is a 0's-based value. To support a single inflight R2T as indicated in the comment above, icreq->maxr2t should be set to 0, right? 

Thanks,
Narayan

-----Original Message-----
From: Linux-nvme <linux-nvme-bounces@lists.infradead.org> On Behalf Of Sagi Grimberg
Sent: Monday, November 19, 2018 7:00 PM
To: linux-nvme at lists.infradead.org
Cc: linux-block at vger.kernel.org; netdev at vger.kernel.org; Keith Busch <keith.busch at intel.com>; David S. Miller <davem at davemloft.net>; Christoph Hellwig <hch at lst.de>
Subject: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver

[EXTERNAL EMAIL]
This email was received from outside the organization.
________________________________

From: Sagi Grimberg <sagi@lightbitslabs.com>

This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:

        nvme connect-all -t tcp -a $IPADDR

This requires the latest version of nvme-cli with TCP support.

Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
Signed-off-by: Roy Shterman <roys at lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas at lightbitslabs.com>
---
 drivers/nvme/host/Kconfig  |   15 +
 drivers/nvme/host/Makefile |    3 +
 drivers/nvme/host/tcp.c    | 2306 ++++++++++++++++++++++++++++++++++++
 3 files changed, 2324 insertions(+)
 create mode 100644 drivers/nvme/host/tcp.c

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..0f345e207675 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -57,3 +57,18 @@ config NVME_FC
          from https://github.com/linux-nvme/nvme-cli.

          If unsure, say N.
+
+config NVME_TCP
+       tristate "NVM Express over Fabrics TCP host driver"
+       depends on INET
+       depends on BLK_DEV_NVME
+       select NVME_FABRICS
+       help
+         This provides support for the NVMe over Fabrics protocol using
+         the TCP transport.  This allows you to use remote block devices
+         exported using the NVMe protocol set.
+
+         To configure a NVMe over Fabrics controller use the nvme-cli tool
+         from https://github.com/linux-nvme/nvme-cli.
+
+         If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index aea459c65ae1..8a4b671c5f0c 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME)              += nvme.o
 obj-$(CONFIG_NVME_FABRICS)             += nvme-fabrics.o
 obj-$(CONFIG_NVME_RDMA)                        += nvme-rdma.o
 obj-$(CONFIG_NVME_FC)                  += nvme-fc.o
+obj-$(CONFIG_NVME_TCP)                 += nvme-tcp.o

 nvme-core-y                            := core.o
 nvme-core-$(CONFIG_TRACING)            += trace.o
@@ -21,3 +22,5 @@ nvme-fabrics-y                                += fabrics.o
 nvme-rdma-y                            += rdma.o

 nvme-fc-y                              += fc.o
+
+nvme-tcp-y                             += tcp.o
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
new file mode 100644
index 000000000000..4c583859f0ad
--- /dev/null
+++ b/drivers/nvme/host/tcp.c
@@ -0,0 +1,2306 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVMe over Fabrics TCP host.
+ * Copyright (c) 2018 LightBits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/nvme-tcp.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <linux/blk-mq.h>
+#include <crypto/hash.h>
+
+#include "nvme.h"
+#include "fabrics.h"
+
+struct nvme_tcp_queue;
+
+enum nvme_tcp_send_state {
+       NVME_TCP_SEND_CMD_PDU = 0,
+       NVME_TCP_SEND_H2C_PDU,
+       NVME_TCP_SEND_DATA,
+       NVME_TCP_SEND_DDGST,
+};
+
+struct nvme_tcp_send_ctx {
+       struct bio              *curr_bio;
+       struct iov_iter         iter;
+       size_t                  offset;
+       size_t                  data_sent;
+       enum nvme_tcp_send_state state;
+};
+
+struct nvme_tcp_recv_ctx {
+       struct iov_iter         iter;
+       struct bio              *curr_bio;
+};
+
+struct nvme_tcp_request {
+       struct nvme_request     req;
+       void                    *pdu;
+       struct nvme_tcp_queue   *queue;
+       u32                     data_len;
+       u32                     pdu_len;
+       u32                     pdu_sent;
+       u16                     ttag;
+       struct list_head        entry;
+       struct nvme_tcp_recv_ctx rcv;
+       struct nvme_tcp_send_ctx snd;
+       u32                     ddgst;
+};
+
+enum nvme_tcp_queue_flags {
+       NVME_TCP_Q_ALLOCATED    = 0,
+       NVME_TCP_Q_LIVE         = 1,
+};
+
+enum nvme_tcp_recv_state {
+       NVME_TCP_RECV_PDU = 0,
+       NVME_TCP_RECV_DATA,
+       NVME_TCP_RECV_DDGST,
+};
+
+struct nvme_tcp_queue_recv_ctx {
+       char            *pdu;
+       int             pdu_remaining;
+       int             pdu_offset;
+       size_t          data_remaining;
+       size_t          ddgst_remaining;
+};
+
+struct nvme_tcp_ctrl;
+struct nvme_tcp_queue {
+       struct socket           *sock;
+       struct work_struct      io_work;
+       int                     io_cpu;
+
+       spinlock_t              lock;
+       struct list_head        send_list;
+
+       int                     queue_size;
+       size_t                  cmnd_capsule_len;
+       struct nvme_tcp_ctrl    *ctrl;
+       unsigned long           flags;
+       bool                    rd_enabled;
+
+       struct nvme_tcp_queue_recv_ctx rcv;
+       struct nvme_tcp_request *request;
+
+       bool                    hdr_digest;
+       bool                    data_digest;
+       struct ahash_request    *rcv_hash;
+       struct ahash_request    *snd_hash;
+       __le32                  exp_ddgst;
+       __le32                  recv_ddgst;
+
+       struct page_frag_cache  pf_cache;
+
+       void (*sc)(struct sock *);
+       void (*dr)(struct sock *);
+       void (*ws)(struct sock *);
+};
+
+struct nvme_tcp_ctrl {
+       /* read only in the hot path */
+       struct nvme_tcp_queue   *queues;
+       struct blk_mq_tag_set   tag_set;
+
+       /* other member variables */
+       struct list_head        list;
+       struct blk_mq_tag_set   admin_tag_set;
+       struct sockaddr_storage addr;
+       struct sockaddr_storage src_addr;
+       struct nvme_ctrl        ctrl;
+
+       struct nvme_tcp_request async_req;
+};
+
+static LIST_HEAD(nvme_tcp_ctrl_list);
+static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
+static struct workqueue_struct *nvme_tcp_wq;
+static struct blk_mq_ops nvme_tcp_mq_ops;
+static struct blk_mq_ops nvme_tcp_admin_mq_ops;
+
+static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
+{
+       return container_of(ctrl, struct nvme_tcp_ctrl, ctrl);
+}
+
+static inline int nvme_tcp_queue_id(struct nvme_tcp_queue *queue)
+{
+       return queue - queue->ctrl->queues;
+}
+
+static inline struct blk_mq_tags *nvme_tcp_tagset(struct nvme_tcp_queue *queue)
+{
+       u32 queue_idx = nvme_tcp_queue_id(queue);
+
+       if (queue_idx == 0)
+               return queue->ctrl->admin_tag_set.tags[queue_idx];
+       return queue->ctrl->tag_set.tags[queue_idx - 1];
+}
+
+static inline u8 nvme_tcp_hdgst_len(struct nvme_tcp_queue *queue)
+{
+       return queue->hdr_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline u8 nvme_tcp_ddgst_len(struct nvme_tcp_queue *queue)
+{
+       return queue->data_digest ? NVME_TCP_DIGEST_LENGTH : 0;
+}
+
+static inline size_t nvme_tcp_inline_data_size(struct nvme_tcp_queue *queue)
+{
+       return queue->cmnd_capsule_len - sizeof(struct nvme_command);
+}
+
+static inline bool nvme_tcp_async_req(struct nvme_tcp_request *req)
+{
+       return req == &req->queue->ctrl->async_req;
+}
+
+static inline bool nvme_tcp_has_inline_data(struct nvme_tcp_request *req)
+{
+       struct request *rq;
+       unsigned int bytes;
+
+       if (unlikely(nvme_tcp_async_req(req)))
+               return false; /* async events don't have a request */
+
+       rq = blk_mq_rq_from_pdu(req);
+       bytes = blk_rq_payload_bytes(rq);
+
+       return rq_data_dir(rq) == WRITE && bytes &&
+               bytes <= nvme_tcp_inline_data_size(req->queue);
+}
+
+static inline struct page *nvme_tcp_req_cur_page(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.bvec->bv_page;
+}
+
+static inline size_t nvme_tcp_req_cur_offset(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.bvec->bv_offset + req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req)
+{
+       return min_t(size_t, req->snd.iter.bvec->bv_len - req->snd.iter.iov_offset,
+                       req->pdu_len - req->pdu_sent);
+}
+
+static inline size_t nvme_tcp_req_offset(struct nvme_tcp_request *req)
+{
+       return req->snd.iter.iov_offset;
+}
+
+static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req)
+{
+       return rq_data_dir(blk_mq_rq_from_pdu(req)) == WRITE ?
+                       req->pdu_len - req->pdu_sent : 0;
+}
+
+static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
+               int len)
+{
+       return nvme_tcp_pdu_data_left(req) <= len;
+}
+
+static void nvme_tcp_init_send_iter(struct nvme_tcp_request *req)
+{
+       struct request *rq = blk_mq_rq_from_pdu(req);
+       struct bio_vec *vec;
+       unsigned int size;
+       int nsegs;
+       size_t offset;
+
+       if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
+               vec = &rq->special_vec;
+               nsegs = 1;
+               size = blk_rq_payload_bytes(rq);
+               offset = 0;
+       } else {
+               struct bio *bio = req->snd.curr_bio;
+
+               vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+               nsegs = bio_segments(bio);
+               size = bio->bi_iter.bi_size;
+               offset = bio->bi_iter.bi_bvec_done;
+       }
+
+       iov_iter_bvec(&req->snd.iter, WRITE, vec, nsegs, size);
+       req->snd.iter.iov_offset = offset;
+}
+
+static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
+               int len)
+{
+       req->snd.data_sent += len;
+       req->pdu_sent += len;
+       iov_iter_advance(&req->snd.iter, len);
+       if (!iov_iter_count(&req->snd.iter) &&
+           req->snd.data_sent < req->data_len) {
+               req->snd.curr_bio = req->snd.curr_bio->bi_next;
+               nvme_tcp_init_send_iter(req);
+       }
+}
+
+static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+
+       spin_lock_bh(&queue->lock);
+       list_add_tail(&req->entry, &queue->send_list);
+       spin_unlock_bh(&queue->lock);
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static inline struct nvme_tcp_request *
+nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_request *req;
+
+       spin_lock_bh(&queue->lock);
+       req = list_first_entry_or_null(&queue->send_list,
+                       struct nvme_tcp_request, entry);
+       if (req)
+               list_del(&req->entry);
+       spin_unlock_bh(&queue->lock);
+
+       return req;
+}
+
+static inline void nvme_tcp_ddgst_final(struct ahash_request *hash, u32 *dgst)
+{
+       ahash_request_set_crypt(hash, NULL, (u8 *)dgst, 0);
+       crypto_ahash_final(hash);
+}
+
+static inline void nvme_tcp_ddgst_update(struct ahash_request *hash,
+               struct page *page, off_t off, size_t len)
+{
+       struct scatterlist sg;
+
+       sg_init_marker(&sg, 1);
+       sg_set_page(&sg, page, len, off);
+       ahash_request_set_crypt(hash, &sg, NULL, len);
+       crypto_ahash_update(hash);
+}
+
+static inline void nvme_tcp_hdgst(struct ahash_request *hash,
+               void *pdu, size_t len)
+{
+       struct scatterlist sg;
+
+       sg_init_one(&sg, pdu, len);
+       ahash_request_set_crypt(hash, &sg, pdu + len, len);
+       crypto_ahash_digest(hash);
+}
+
+static int nvme_tcp_verify_hdgst(struct nvme_tcp_queue *queue,
+       void *pdu, size_t pdu_len)
+{
+       struct nvme_tcp_hdr *hdr = pdu;
+       __le32 recv_digest;
+       __le32 exp_digest;
+
+       if (unlikely(!(hdr->flags & NVME_TCP_F_HDGST))) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d: header digest flag is cleared\n",
+                       nvme_tcp_queue_id(queue));
+               return -EPROTO;
+       }
+
+       recv_digest = *(__le32 *)(pdu + hdr->hlen);
+       nvme_tcp_hdgst(queue->rcv_hash, pdu, pdu_len);
+       exp_digest = *(__le32 *)(pdu + hdr->hlen);
+       if (recv_digest != exp_digest) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "header digest error: recv %#x expected %#x\n",
+                       le32_to_cpu(recv_digest), le32_to_cpu(exp_digest));
+               return -EIO;
+       }
+
+       return 0;
+}
+
+static int nvme_tcp_check_ddgst(struct nvme_tcp_queue *queue, void *pdu)
+{
+       struct nvme_tcp_hdr *hdr = pdu;
+       u32 len;
+
+       len = le32_to_cpu(hdr->plen) - hdr->hlen -
+               ((hdr->flags & NVME_TCP_F_HDGST) ? nvme_tcp_hdgst_len(queue) : 0);
+
+       if (unlikely(len && !(hdr->flags & NVME_TCP_F_DDGST))) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d: data digest flag is cleared\n",
+               nvme_tcp_queue_id(queue));
+               return -EPROTO;
+       }
+       crypto_ahash_init(queue->rcv_hash);
+
+       return 0;
+}
+
+static void nvme_tcp_exit_request(struct blk_mq_tag_set *set,
+               struct request *rq, unsigned int hctx_idx)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+       page_frag_free(req->pdu);
+}
+
+static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
+               struct request *rq, unsigned int hctx_idx,
+               unsigned int numa_node)
+{
+       struct nvme_tcp_ctrl *ctrl = set->driver_data;
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       int queue_idx = (set == &ctrl->tag_set) ? hctx_idx + 1 : 0;
+       struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       req->pdu = page_frag_alloc(&queue->pf_cache,
+               sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+               GFP_KERNEL | __GFP_ZERO);
+       if (!req->pdu)
+               return -ENOMEM;
+
+       req->queue = queue;
+       nvme_req(rq)->ctrl = &ctrl->ctrl;
+
+       return 0;
+}
+
+static int nvme_tcp_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+               unsigned int hctx_idx)
+{
+       struct nvme_tcp_ctrl *ctrl = data;
+       struct nvme_tcp_queue *queue = &ctrl->queues[hctx_idx + 1];
+
+       BUG_ON(hctx_idx >= ctrl->ctrl.queue_count);
+
+       hctx->driver_data = queue;
+       return 0;
+}
+
+static int nvme_tcp_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+               unsigned int hctx_idx)
+{
+       struct nvme_tcp_ctrl *ctrl = data;
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+
+       BUG_ON(hctx_idx != 0);
+
+       hctx->driver_data = queue;
+       return 0;
+}
+
+static enum nvme_tcp_recv_state nvme_tcp_recv_state(struct nvme_tcp_queue *queue)
+{
+       return  (queue->rcv.pdu_remaining) ? NVME_TCP_RECV_PDU :
+               (queue->rcv.ddgst_remaining) ? NVME_TCP_RECV_DDGST :
+               NVME_TCP_RECV_DATA;
+}
+
+static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+
+       rcv->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
+                               nvme_tcp_hdgst_len(queue);
+       rcv->pdu_offset = 0;
+       rcv->data_remaining = -1;
+       rcv->ddgst_remaining = 0;
+}
+
+void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
+{
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+               return;
+
+       queue_work(nvme_wq, &ctrl->err_work);
+}
+
+static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
+               struct nvme_completion *cqe)
+{
+       struct request *rq;
+       struct nvme_tcp_request *req;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag 0x%x not found\n",
+                       nvme_tcp_queue_id(queue), cqe->command_id);
+               nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+               return -EINVAL;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       nvme_end_request(rq, cqe->status, cqe->result);
+
+       return 0;
+}
+
+static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_data_pdu *pdu)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_request *req;
+       struct request *rq;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       if (!blk_rq_payload_bytes(rq)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x unexpected data\n",
+                       nvme_tcp_queue_id(queue), rq->tag);
+               return -EIO;
+       }
+
+       rcv->data_remaining = le32_to_cpu(pdu->data_length);
+       /* No support for out-of-order */
+       WARN_ON(le32_to_cpu(pdu->data_offset));
+
+       return 0;
+
+}
+
+static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_rsp_pdu *pdu)
+{
+       struct nvme_completion *cqe = &pdu->cqe;
+       int ret = 0;
+
+       /*
+        * AEN requests are special as they don't time out and can
+        * survive any kind of queue freeze and often don't respond to
+        * aborts.  We don't even bother to allocate a struct request
+        * for them but rather special case them here.
+        */
+       if (unlikely(nvme_tcp_queue_id(queue) == 0 &&
+           cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH))
+               nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
+                               &cqe->result);
+       else
+               ret = nvme_tcp_process_nvme_cqe(queue, cqe);
+
+       return ret;
+}
+
+static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
+               struct nvme_tcp_r2t_pdu *pdu)
+{
+       struct nvme_tcp_data_pdu *data = req->pdu;
+       struct nvme_tcp_queue *queue = req->queue;
+       struct request *rq = blk_mq_rq_from_pdu(req);
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       u8 ddgst = nvme_tcp_ddgst_len(queue);
+
+       req->pdu_len = le32_to_cpu(pdu->r2t_length);
+       req->pdu_sent = 0;
+
+       if (unlikely(req->snd.data_sent + req->pdu_len > req->data_len)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "req %d r2t length %u exceeded data length %u (%zu sent)\n",
+                       rq->tag, req->pdu_len, req->data_len,
+                       req->snd.data_sent);
+               return -EPROTO;
+       }
+
+       if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->snd.data_sent)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "req %d unexpected r2t offset %u (expected %zu)\n",
+                       rq->tag, le32_to_cpu(pdu->r2t_offset),
+                       req->snd.data_sent);
+               return -EPROTO;
+       }
+
+       memset(data, 0, sizeof(*data));
+       data->hdr.type = nvme_tcp_h2c_data;
+       data->hdr.flags = NVME_TCP_F_DATA_LAST;
+       if (queue->hdr_digest)
+               data->hdr.flags |= NVME_TCP_F_HDGST;
+       if (queue->data_digest)
+               data->hdr.flags |= NVME_TCP_F_DDGST;
+       data->hdr.hlen = sizeof(*data);
+       data->hdr.pdo = data->hdr.hlen + hdgst;
+       data->hdr.plen =
+               cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst);
+       data->ttag = pdu->ttag;
+       data->command_id = rq->tag;
+       data->data_offset = cpu_to_le32(req->snd.data_sent);
+       data->data_length = cpu_to_le32(req->pdu_len);
+       return 0;
+}
+
+static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_r2t_pdu *pdu)
+{
+       struct nvme_tcp_request *req;
+       struct request *rq;
+       int ret;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       ret = nvme_tcp_setup_h2c_data_pdu(req, pdu);
+       if (unlikely(ret))
+               return ret;
+
+       req->snd.state = NVME_TCP_SEND_H2C_PDU;
+       req->snd.offset = 0;
+
+       nvme_tcp_queue_request(req);
+
+       return 0;
+}
+
+static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+               unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_hdr *hdr;
+       size_t rcv_len = min_t(size_t, *len, rcv->pdu_remaining);
+       int ret;
+
+       ret = skb_copy_bits(skb, *offset, &rcv->pdu[rcv->pdu_offset], rcv_len);
+       if (unlikely(ret))
+               return ret;
+
+       rcv->pdu_remaining -= rcv_len;
+       rcv->pdu_offset += rcv_len;
+       *offset += rcv_len;
+       *len -= rcv_len;
+       if (queue->rcv.pdu_remaining)
+               return 0;
+
+       hdr = (void *)rcv->pdu;
+       if (queue->hdr_digest) {
+               ret = nvme_tcp_verify_hdgst(queue, rcv->pdu, hdr->hlen);
+               if (unlikely(ret))
+                       return ret;
+       }
+
+
+       if (queue->data_digest) {
+               ret = nvme_tcp_check_ddgst(queue, rcv->pdu);
+               if (unlikely(ret))
+                       return ret;
+       }
+
+       switch (hdr->type) {
+       case nvme_tcp_c2h_data:
+               ret = nvme_tcp_handle_c2h_data(queue, (void *)rcv->pdu);
+               break;
+       case nvme_tcp_rsp:
+               nvme_tcp_init_recv_ctx(queue);
+               ret = nvme_tcp_handle_comp(queue, (void *)rcv->pdu);
+               break;
+       case nvme_tcp_r2t:
+               nvme_tcp_init_recv_ctx(queue);
+               ret = nvme_tcp_handle_r2t(queue, (void *)rcv->pdu);
+               break;
+       default:
+               dev_err(queue->ctrl->ctrl.device, "unsupported pdu type (%d)\n",
+                       hdr->type);
+               return -EINVAL;
+       }
+
+       return ret;
+}
+
+static void nvme_tcp_init_recv_iter(struct nvme_tcp_request *req)
+{
+       struct bio *bio = req->rcv.curr_bio;
+       struct bio_vec *vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+       unsigned int nsegs = bio_segments(bio);
+
+       iov_iter_bvec(&req->rcv.iter, READ, vec, nsegs,
+               bio->bi_iter.bi_size);
+       req->rcv.iter.iov_offset = bio->bi_iter.bi_bvec_done;
+}
+
+static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
+                             unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       struct nvme_tcp_data_pdu *pdu = (void *)rcv->pdu;
+       struct nvme_tcp_request *req;
+       struct request *rq;
+
+       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
+       if (!rq) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x not found\n",
+                       nvme_tcp_queue_id(queue), pdu->command_id);
+               return -ENOENT;
+       }
+       req = blk_mq_rq_to_pdu(rq);
+
+       while (true) {
+               int recv_len, ret;
+
+               recv_len = min_t(size_t, *len, rcv->data_remaining);
+               if (!recv_len)
+                       break;
+
+               /*
+                * FIXME: This assumes that data comes in-order,
+                *  need to handle the out-of-order case.
+                */
+               if (!iov_iter_count(&req->rcv.iter)) {
+                       req->rcv.curr_bio = req->rcv.curr_bio->bi_next;
+
+                       /*
+                        * If we don`t have any bios it means that controller
+                        * sent more data than we requested, hence error
+                        */
+                       if (!req->rcv.curr_bio) {
+                               dev_err(queue->ctrl->ctrl.device,
+                                       "queue %d no space in request %#x",
+                                       nvme_tcp_queue_id(queue), rq->tag);
+                               nvme_tcp_init_recv_ctx(queue);
+                               return -EIO;
+                       }
+                       nvme_tcp_init_recv_iter(req);
+               }
+
+               /* we can read only from what is left in this bio */
+               recv_len = min_t(size_t, recv_len,
+                               iov_iter_count(&req->rcv.iter));
+
+               if (queue->data_digest)
+                       ret = skb_copy_and_hash_datagram_iter(skb, *offset,
+                               &req->rcv.iter, recv_len, queue->rcv_hash);
+               else
+                       ret = skb_copy_datagram_iter(skb, *offset,
+                                       &req->rcv.iter, recv_len);
+               if (ret) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "queue %d failed to copy request %#x data",
+                               nvme_tcp_queue_id(queue), rq->tag);
+                       return ret;
+               }
+
+               *len -= recv_len;
+               *offset += recv_len;
+               rcv->data_remaining -= recv_len;
+       }
+
+       if (!rcv->data_remaining) {
+               if (queue->data_digest) {
+                       nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
+                       rcv->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
+               } else {
+                       nvme_tcp_init_recv_ctx(queue);
+               }
+       }
+
+       return 0;
+}
+
+static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
+               struct sk_buff *skb, unsigned int *offset, size_t *len)
+{
+       struct nvme_tcp_queue_recv_ctx *rcv = &queue->rcv;
+       char *ddgst = (char *)&queue->recv_ddgst;
+       size_t recv_len = min_t(size_t, *len, rcv->ddgst_remaining);
+       off_t off = NVME_TCP_DIGEST_LENGTH - rcv->ddgst_remaining;
+       int ret;
+
+       ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
+       if (unlikely(ret))
+               return ret;
+
+       rcv->ddgst_remaining -= recv_len;
+       *offset += recv_len;
+       *len -= recv_len;
+       if (rcv->ddgst_remaining)
+               return 0;
+
+       if (queue->recv_ddgst != queue->exp_ddgst) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "data digest error: recv %#x expected %#x\n",
+                       le32_to_cpu(queue->recv_ddgst),
+                       le32_to_cpu(queue->exp_ddgst));
+               return -EIO;
+       }
+
+       nvme_tcp_init_recv_ctx(queue);
+       return 0;
+}
+
+static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+                            unsigned int offset, size_t len)
+{
+       struct nvme_tcp_queue *queue = desc->arg.data;
+       size_t consumed = len;
+       int result;
+
+       while (len) {
+               switch (nvme_tcp_recv_state(queue)) {
+               case NVME_TCP_RECV_PDU:
+                       result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
+                       break;
+               case NVME_TCP_RECV_DATA:
+                       result = nvme_tcp_recv_data(queue, skb, &offset, &len);
+                       break;
+               case NVME_TCP_RECV_DDGST:
+                       result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
+                       break;
+               default:
+                       result = -EFAULT;
+               }
+               if (result) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "receive failed:  %d\n", result);
+                       queue->rd_enabled = false;
+                       nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+                       return result;
+               }
+       }
+
+       return consumed;
+}
+
+static void nvme_tcp_data_ready(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+       if (unlikely(!queue || !queue->rd_enabled))
+               goto done;
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+done:
+       read_unlock(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_write_space(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock_bh(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+
+       if (!queue)
+               goto done;
+
+       if (sk_stream_is_writeable(sk)) {
+               clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+               queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+       }
+done:
+       read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvme_tcp_state_change(struct sock *sk)
+{
+       struct nvme_tcp_queue *queue;
+
+       read_lock(&sk->sk_callback_lock);
+       queue = sk->sk_user_data;
+       if (!queue)
+               goto done;
+
+       switch (sk->sk_state) {
+       case TCP_CLOSE:
+       case TCP_CLOSE_WAIT:
+       case TCP_LAST_ACK:
+       case TCP_FIN_WAIT1:
+       case TCP_FIN_WAIT2:
+               /* fallthrough */
+               nvme_tcp_error_recovery(&queue->ctrl->ctrl);
+               break;
+       default:
+               dev_info(queue->ctrl->ctrl.device,
+                       "queue %d socket state %d\n",
+                       nvme_tcp_queue_id(queue), sk->sk_state);
+       }
+
+       queue->sc(sk);
+done:
+       read_unlock(&sk->sk_callback_lock);
+}
+
+static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
+{
+       queue->request = NULL;
+}
+
+static void nvme_tcp_fail_request(struct nvme_tcp_request *req)
+{
+       union nvme_result res = {};
+
+       nvme_end_request(blk_mq_rq_from_pdu(req),
+               NVME_SC_DATA_XFER_ERROR, res);
+}
+
+static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+
+       while (true) {
+               struct page *page = nvme_tcp_req_cur_page(req);
+               size_t offset = nvme_tcp_req_cur_offset(req);
+               size_t len = nvme_tcp_req_cur_length(req);
+               bool last = nvme_tcp_pdu_last_send(req, len);
+               int ret, flags = MSG_DONTWAIT;
+
+               if (last && !queue->data_digest)
+                       flags |= MSG_EOR;
+               else
+                       flags |= MSG_MORE;
+
+               ret = kernel_sendpage(queue->sock, page, offset, len, flags);
+               if (ret <= 0)
+                       return ret;
+
+               nvme_tcp_advance_req(req, ret);
+               if (queue->data_digest)
+                       nvme_tcp_ddgst_update(queue->snd_hash, page, offset, ret);
+
+               /* fully successfull last write*/
+               if (last && ret == len) {
+                       if (queue->data_digest) {
+                               nvme_tcp_ddgst_final(queue->snd_hash,
+                                       &req->ddgst);
+                               req->snd.state = NVME_TCP_SEND_DDGST;
+                               req->snd.offset = 0;
+                       } else {
+                               nvme_tcp_done_send_req(queue);
+                       }
+                       return 1;
+               }
+       }
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       struct nvme_tcp_send_ctx *snd = &req->snd;
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       bool inline_data = nvme_tcp_has_inline_data(req);
+       int flags = MSG_DONTWAIT | (inline_data ? MSG_MORE : MSG_EOR);
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       int len = sizeof(*pdu) + hdgst - snd->offset;
+       int ret;
+
+       if (queue->hdr_digest && !snd->offset)
+               nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+       ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+                       offset_in_page(pdu) + snd->offset, len,  flags);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       len -= ret;
+       if (!len) {
+               if (inline_data) {
+                       req->snd.state = NVME_TCP_SEND_DATA;
+                       if (queue->data_digest)
+                               crypto_ahash_init(queue->snd_hash);
+                       nvme_tcp_init_send_iter(req);
+               } else {
+                       nvme_tcp_done_send_req(queue);
+               }
+               return 1;
+       }
+       snd->offset += ret;
+
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       struct nvme_tcp_send_ctx *snd = &req->snd;
+       struct nvme_tcp_data_pdu *pdu = req->pdu;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+       int len = sizeof(*pdu) - snd->offset + hdgst;
+       int ret;
+
+       if (queue->hdr_digest && !snd->offset)
+               nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
+
+       ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
+                       offset_in_page(pdu) + snd->offset, len,
+                       MSG_DONTWAIT | MSG_MORE);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       len -= ret;
+       if (!len) {
+               req->snd.state = NVME_TCP_SEND_DATA;
+               if (queue->data_digest)
+                       crypto_ahash_init(queue->snd_hash);
+               if (!req->snd.data_sent)
+                       nvme_tcp_init_send_iter(req);
+               return 1;
+       }
+       snd->offset += ret;
+
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
+{
+       struct nvme_tcp_queue *queue = req->queue;
+       int ret;
+       struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+       struct kvec iov = {
+               .iov_base = &req->ddgst + req->snd.offset,
+               .iov_len = NVME_TCP_DIGEST_LENGTH - req->snd.offset
+       };
+
+       ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+       if (unlikely(ret <= 0))
+               return ret;
+
+       if (req->snd.offset + ret == NVME_TCP_DIGEST_LENGTH) {
+               nvme_tcp_done_send_req(queue);
+               return 1;
+       }
+
+       req->snd.offset += ret;
+       return -EAGAIN;
+}
+
+static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_request *req;
+       int ret = 1;
+
+       if (!queue->request) {
+               queue->request = nvme_tcp_fetch_request(queue);
+               if (!queue->request)
+                       return 0;
+       }
+       req = queue->request;
+
+       if (req->snd.state == NVME_TCP_SEND_CMD_PDU) {
+               ret = nvme_tcp_try_send_cmd_pdu(req);
+               if (ret <= 0)
+                       goto done;
+               if (!nvme_tcp_has_inline_data(req))
+                       return ret;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_H2C_PDU) {
+               ret = nvme_tcp_try_send_data_pdu(req);
+               if (ret <= 0)
+                       goto done;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_DATA) {
+               ret = nvme_tcp_try_send_data(req);
+               if (ret <= 0)
+                       goto done;
+       }
+
+       if (req->snd.state == NVME_TCP_SEND_DDGST)
+               ret = nvme_tcp_try_send_ddgst(req);
+done:
+       if (ret == -EAGAIN)
+               ret = 0;
+       return ret;
+}
+
+static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
+{
+       struct sock *sk = queue->sock->sk;
+       read_descriptor_t rd_desc;
+       int consumed;
+
+       rd_desc.arg.data = queue;
+       rd_desc.count = 1;
+       lock_sock(sk);
+       consumed = tcp_read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+       release_sock(sk);
+       return consumed;
+}
+
+static void nvme_tcp_io_work(struct work_struct *w)
+{
+       struct nvme_tcp_queue *queue =
+               container_of(w, struct nvme_tcp_queue, io_work);
+       unsigned long start = jiffies + msecs_to_jiffies(1);
+
+       do {
+               bool pending = false;
+               int result;
+
+               result = nvme_tcp_try_send(queue);
+               if (result > 0) {
+                       pending = true;
+               } else if (unlikely(result < 0)) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "failed to send request %d\n", result);
+                       if (result != -EPIPE)
+                               nvme_tcp_fail_request(queue->request);
+                       nvme_tcp_done_send_req(queue);
+                       return;
+               }
+
+               result = nvme_tcp_try_recv(queue);
+               if (result > 0)
+                       pending = true;
+
+               if (!pending)
+                       return;
+
+       } while (time_after(jiffies, start)); /* quota is exhausted */
+
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
+static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
+{
+       struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
+
+       ahash_request_free(queue->rcv_hash);
+       ahash_request_free(queue->snd_hash);
+       crypto_free_ahash(tfm);
+}
+
+static int nvme_tcp_alloc_crypto(struct nvme_tcp_queue *queue)
+{
+       struct crypto_ahash *tfm;
+
+       tfm = crypto_alloc_ahash("crc32c", 0, CRYPTO_ALG_ASYNC);
+       if (IS_ERR(tfm))
+               return PTR_ERR(tfm);
+
+       queue->snd_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+       if (!queue->snd_hash)
+               goto free_tfm;
+       ahash_request_set_callback(queue->snd_hash, 0, NULL, NULL);
+
+       queue->rcv_hash = ahash_request_alloc(tfm, GFP_KERNEL);
+       if (!queue->rcv_hash)
+               goto free_snd_hash;
+       ahash_request_set_callback(queue->rcv_hash, 0, NULL, NULL);
+
+       return 0;
+free_snd_hash:
+       ahash_request_free(queue->snd_hash);
+free_tfm:
+       crypto_free_ahash(tfm);
+       return -ENOMEM;
+}
+
+static void nvme_tcp_free_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+       struct nvme_tcp_request *async = &ctrl->async_req;
+
+       page_frag_free(async->pdu);
+}
+
+static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
+{
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+       struct nvme_tcp_request *async = &ctrl->async_req;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       async->pdu = page_frag_alloc(&queue->pf_cache,
+               sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+               GFP_KERNEL | __GFP_ZERO);
+       if (!async->pdu)
+               return -ENOMEM;
+
+       async->queue = &ctrl->queues[0];
+       return 0;
+}
+
+static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+       if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
+               return;
+
+       if (queue->hdr_digest || queue->data_digest)
+               nvme_tcp_free_crypto(queue);
+
+       sock_release(queue->sock);
+       kfree(queue->rcv.pdu);
+}
+
+static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
+{
+       struct nvme_tcp_icreq_pdu *icreq;
+       struct nvme_tcp_icresp_pdu *icresp;
+       struct msghdr msg = {};
+       struct kvec iov;
+       bool ctrl_hdgst, ctrl_ddgst;
+       int ret;
+
+       icreq = kzalloc(sizeof(*icreq), GFP_KERNEL);
+       if (!icreq)
+               return -ENOMEM;
+
+       icresp = kzalloc(sizeof(*icresp), GFP_KERNEL);
+       if (!icresp) {
+               ret = -ENOMEM;
+               goto free_icreq;
+       }
+
+       icreq->hdr.type = nvme_tcp_icreq;
+       icreq->hdr.hlen = sizeof(*icreq);
+       icreq->hdr.pdo = 0;
+       icreq->hdr.plen = cpu_to_le32(icreq->hdr.hlen);
+       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
+       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
+       icreq->hpda = 0; /* no alignment constraint */
+       if (queue->hdr_digest)
+               icreq->digest |= NVME_TCP_HDR_DIGEST_ENABLE;
+       if (queue->data_digest)
+               icreq->digest |= NVME_TCP_DATA_DIGEST_ENABLE;
+
+       iov.iov_base = icreq;
+       iov.iov_len = sizeof(*icreq);
+       ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+       if (ret < 0)
+               goto free_icresp;
+
+       memset(&msg, 0, sizeof(msg));
+       iov.iov_base = icresp;
+       iov.iov_len = sizeof(*icresp);
+       ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+                       iov.iov_len, msg.msg_flags);
+       if (ret < 0)
+               goto free_icresp;
+
+       ret = -EINVAL;
+       if (icresp->hdr.type != nvme_tcp_icresp) {
+               pr_err("queue %d: bad type returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->hdr.type);
+               goto free_icresp;
+       }
+
+       if (le32_to_cpu(icresp->hdr.plen) != sizeof(*icresp)) {
+               pr_err("queue %d: bad pdu length returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->hdr.plen);
+               goto free_icresp;
+       }
+
+       if (icresp->pfv != NVME_TCP_PFV_1_0) {
+               pr_err("queue %d: bad pfv returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->pfv);
+               goto free_icresp;
+       }
+
+       ctrl_ddgst = !!(icresp->digest & NVME_TCP_DATA_DIGEST_ENABLE);
+       if ((queue->data_digest && !ctrl_ddgst) ||
+           (!queue->data_digest && ctrl_ddgst)) {
+               pr_err("queue %d: data digest mismatch host: %s ctrl: %s\n",
+                       nvme_tcp_queue_id(queue),
+                       queue->data_digest ? "enabled" : "disabled",
+                       ctrl_ddgst ? "enabled" : "disabled");
+               goto free_icresp;
+       }
+
+       ctrl_hdgst = !!(icresp->digest & NVME_TCP_HDR_DIGEST_ENABLE);
+       if ((queue->hdr_digest && !ctrl_hdgst) ||
+           (!queue->hdr_digest && ctrl_hdgst)) {
+               pr_err("queue %d: header digest mismatch host: %s ctrl: %s\n",
+                       nvme_tcp_queue_id(queue),
+                       queue->hdr_digest ? "enabled" : "disabled",
+                       ctrl_hdgst ? "enabled" : "disabled");
+               goto free_icresp;
+       }
+
+       if (icresp->cpda != 0) {
+               pr_err("queue %d: unsupported cpda returned %d\n",
+                       nvme_tcp_queue_id(queue), icresp->cpda);
+               goto free_icresp;
+       }
+
+       ret = 0;
+free_icresp:
+       kfree(icresp);
+free_icreq:
+       kfree(icreq);
+       return ret;
+}
+
+static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
+               int qid, size_t queue_size)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+       struct linger sol = { .l_onoff = 1, .l_linger = 0 };
+       int ret, opt, rcv_pdu_size;
+
+       queue->ctrl = ctrl;
+       INIT_LIST_HEAD(&queue->send_list);
+       spin_lock_init(&queue->lock);
+       INIT_WORK(&queue->io_work, nvme_tcp_io_work);
+       queue->queue_size = queue_size;
+
+       if (qid > 0)
+               queue->cmnd_capsule_len = ctrl->ctrl.ioccsz * 16;
+       else
+               queue->cmnd_capsule_len = sizeof(struct nvme_command) +
+                                               NVME_TCP_ADMIN_CCSZ;
+
+       ret = sock_create(ctrl->addr.ss_family, SOCK_STREAM,
+                       IPPROTO_TCP, &queue->sock);
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to create socket: %d\n", ret);
+               return ret;
+       }
+
+       /* Single syn retry */
+       opt = 1;
+       ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_SYNCNT,
+                       (char *)&opt, sizeof(opt));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set TCP_SYNCNT sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       /* Set TCP no delay */
+       opt = 1;
+       ret = kernel_setsockopt(queue->sock, IPPROTO_TCP,
+                       TCP_NODELAY, (char *)&opt, sizeof(opt));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set TCP_NODELAY sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       /*
+        * Cleanup whatever is sitting in the TCP transmit queue on socket
+        * close. This is done to prevent stale data from being sent should
+        * the network connection be restored before TCP times out.
+        */
+       ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_LINGER,
+                       (char *)&sol, sizeof(sol));
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to set SO_LINGER sock opt %d\n", ret);
+               goto err_sock;
+       }
+
+       queue->sock->sk->sk_allocation = GFP_ATOMIC;
+       queue->io_cpu = (qid == 0) ? 0 : qid - 1;
+       queue->request = NULL;
+       queue->rcv.data_remaining = 0;
+       queue->rcv.ddgst_remaining = 0;
+       queue->rcv.pdu_remaining = 0;
+       queue->rcv.pdu_offset = 0;
+       sk_set_memalloc(queue->sock->sk);
+
+       if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR) {
+               ret = kernel_bind(queue->sock, (struct sockaddr *)&ctrl->src_addr,
+                       sizeof(ctrl->src_addr));
+               if (ret) {
+                       dev_err(ctrl->ctrl.device,
+                               "failed to bind queue %d socket %d\n",
+                               qid, ret);
+                       goto err_sock;
+               }
+       }
+
+       queue->hdr_digest = nctrl->opts->hdr_digest;
+       queue->data_digest = nctrl->opts->data_digest;
+       if (queue->hdr_digest || queue->data_digest) {
+               ret = nvme_tcp_alloc_crypto(queue);
+               if (ret) {
+                       dev_err(ctrl->ctrl.device,
+                               "failed to allocate queue %d crypto\n", qid);
+                       goto err_sock;
+               }
+       }
+
+       rcv_pdu_size = sizeof(struct nvme_tcp_rsp_pdu) +
+                       nvme_tcp_hdgst_len(queue);
+       queue->rcv.pdu = kmalloc(rcv_pdu_size, GFP_KERNEL);
+       if (!queue->rcv.pdu) {
+               ret = -ENOMEM;
+               goto err_crypto;
+       }
+
+       dev_dbg(ctrl->ctrl.device, "connecting queue %d\n",
+                       nvme_tcp_queue_id(queue));
+
+       ret = kernel_connect(queue->sock, (struct sockaddr *)&ctrl->addr,
+               sizeof(ctrl->addr), 0);
+       if (ret) {
+               dev_err(ctrl->ctrl.device,
+                       "failed to connect socket: %d\n", ret);
+               goto err_rcv_pdu;
+       }
+
+       ret = nvme_tcp_init_connection(queue);
+       if (ret)
+               goto err_init_connect;
+
+       queue->rd_enabled = true;
+       set_bit(NVME_TCP_Q_ALLOCATED, &queue->flags);
+       nvme_tcp_init_recv_ctx(queue);
+
+       write_lock_bh(&queue->sock->sk->sk_callback_lock);
+       queue->sock->sk->sk_user_data = queue;
+       queue->sc = queue->sock->sk->sk_state_change;
+       queue->dr = queue->sock->sk->sk_data_ready;
+       queue->ws = queue->sock->sk->sk_write_space;
+       queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
+       queue->sock->sk->sk_state_change = nvme_tcp_state_change;
+       queue->sock->sk->sk_write_space = nvme_tcp_write_space;
+       write_unlock_bh(&queue->sock->sk->sk_callback_lock);
+
+       return 0;
+
+err_init_connect:
+       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+err_rcv_pdu:
+       kfree(queue->rcv.pdu);
+err_crypto:
+       if (queue->hdr_digest || queue->data_digest)
+               nvme_tcp_free_crypto(queue);
+err_sock:
+       sock_release(queue->sock);
+       queue->sock = NULL;
+       return ret;
+}
+
+static void nvme_tcp_restore_sock_calls(struct nvme_tcp_queue *queue)
+{
+       struct socket *sock = queue->sock;
+
+       write_lock_bh(&sock->sk->sk_callback_lock);
+       sock->sk->sk_user_data  = NULL;
+       sock->sk->sk_data_ready = queue->dr;
+       sock->sk->sk_state_change = queue->sc;
+       sock->sk->sk_write_space  = queue->ws;
+       write_unlock_bh(&sock->sk->sk_callback_lock);
+}
+
+static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
+{
+       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
+       nvme_tcp_restore_sock_calls(queue);
+       cancel_work_sync(&queue->io_work);
+}
+
+static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+
+       if (!test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags))
+               return;
+
+       __nvme_tcp_stop_queue(queue);
+}
+
+static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       int ret;
+
+       if (idx)
+               ret = nvmf_connect_io_queue(nctrl, idx);
+       else
+               ret = nvmf_connect_admin_queue(nctrl);
+
+       if (!ret) {
+               set_bit(NVME_TCP_Q_LIVE, &ctrl->queues[idx].flags);
+       } else {
+               __nvme_tcp_stop_queue(&ctrl->queues[idx]);
+               dev_err(nctrl->device,
+                       "failed to connect queue: %d ret=%d\n", idx, ret);
+       }
+       return ret;
+}
+
+static void nvme_tcp_free_tagset(struct nvme_ctrl *nctrl,
+               struct blk_mq_tag_set *set)
+{
+       blk_mq_free_tag_set(set);
+}
+
+static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
+               bool admin)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct blk_mq_tag_set *set;
+       int ret;
+
+       if (admin) {
+               set = &ctrl->admin_tag_set;
+               memset(set, 0, sizeof(*set));
+               set->ops = &nvme_tcp_admin_mq_ops;
+               set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+               set->reserved_tags = 2; /* connect + keep-alive */
+               set->numa_node = NUMA_NO_NODE;
+               set->cmd_size = sizeof(struct nvme_tcp_request);
+               set->driver_data = ctrl;
+               set->nr_hw_queues = 1;
+               set->timeout = ADMIN_TIMEOUT;
+       } else {
+               set = &ctrl->tag_set;
+               memset(set, 0, sizeof(*set));
+               set->ops = &nvme_tcp_mq_ops;
+               set->queue_depth = nctrl->sqsize + 1;
+               set->reserved_tags = 1; /* fabric connect */
+               set->numa_node = NUMA_NO_NODE;
+               set->flags = BLK_MQ_F_SHOULD_MERGE;
+               set->cmd_size = sizeof(struct nvme_tcp_request);
+               set->driver_data = ctrl;
+               set->nr_hw_queues = nctrl->queue_count - 1;
+               set->timeout = NVME_IO_TIMEOUT;
+       }
+
+       ret = blk_mq_alloc_tag_set(set);
+       if (ret)
+               return ERR_PTR(ret);
+
+       return set;
+}
+
+static void nvme_tcp_free_admin_queue(struct nvme_ctrl *ctrl)
+{
+       if (to_tcp_ctrl(ctrl)->async_req.pdu) {
+               nvme_tcp_free_async_req(to_tcp_ctrl(ctrl));
+               to_tcp_ctrl(ctrl)->async_req.pdu = NULL;
+       }
+
+       nvme_tcp_free_queue(ctrl, 0);
+}
+
+static void nvme_tcp_free_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i;
+
+       for (i = 1; i < ctrl->queue_count; i++)
+               nvme_tcp_free_queue(ctrl, i);
+}
+
+static void nvme_tcp_stop_admin_queue(struct nvme_ctrl *ctrl)
+{
+       nvme_tcp_stop_queue(ctrl, 0);
+}
+
+static void nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i;
+
+       for (i = 1; i < ctrl->queue_count; i++)
+               nvme_tcp_stop_queue(ctrl, i);
+}
+
+static int nvme_tcp_start_admin_queue(struct nvme_ctrl *ctrl)
+{
+       return nvme_tcp_start_queue(ctrl, 0);
+}
+
+static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i, ret = 0;
+
+       for (i = 1; i < ctrl->queue_count; i++) {
+               ret = nvme_tcp_start_queue(ctrl, i);
+               if (ret)
+                       goto out_stop_queues;
+       }
+
+       return 0;
+
+out_stop_queues:
+       for (i--; i >= 1; i--)
+               nvme_tcp_stop_queue(ctrl, i);
+       return ret;
+}
+
+static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
+{
+       int ret;
+
+       ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH);
+       if (ret)
+               return ret;
+
+       ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
+       if (ret)
+               goto out_free_queue;
+
+       return 0;
+
+out_free_queue:
+       nvme_tcp_free_queue(ctrl, 0);
+       return ret;
+}
+
+static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+       int i, ret;
+
+       for (i = 1; i < ctrl->queue_count; i++) {
+               ret = nvme_tcp_alloc_queue(ctrl, i,
+                               ctrl->sqsize + 1);
+               if (ret)
+                       goto out_free_queues;
+       }
+
+       return 0;
+
+out_free_queues:
+       for (i--; i >= 1; i--)
+               nvme_tcp_free_queue(ctrl, i);
+
+       return ret;
+}
+
+static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl)
+{
+       return min(ctrl->queue_count - 1, num_online_cpus());
+}
+
+static int nvme_alloc_io_queues(struct nvme_ctrl *ctrl)
+{
+       unsigned int nr_io_queues;
+       int ret;
+
+       nr_io_queues = nvme_tcp_nr_io_queues(ctrl);
+       ret = nvme_set_queue_count(ctrl, &nr_io_queues);
+       if (ret)
+               return ret;
+
+       ctrl->queue_count = nr_io_queues + 1;
+       if (ctrl->queue_count < 2)
+               return 0;
+
+       dev_info(ctrl->device,
+               "creating %d I/O queues.\n", nr_io_queues);
+
+       return nvme_tcp_alloc_io_queues(ctrl);
+}
+
+void nvme_tcp_destroy_io_queues(struct nvme_ctrl *ctrl, bool remove)
+{
+       nvme_tcp_stop_io_queues(ctrl);
+       if (remove) {
+               if (ctrl->ops->flags & NVME_F_FABRICS)
+                       blk_cleanup_queue(ctrl->connect_q);
+               nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+       }
+       nvme_tcp_free_io_queues(ctrl);
+}
+
+int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
+{
+       int ret;
+
+       ret = nvme_alloc_io_queues(ctrl);
+       if (ret)
+               return ret;
+
+       if (new) {
+               ctrl->tagset = nvme_tcp_alloc_tagset(ctrl, false);
+               if (IS_ERR(ctrl->tagset)) {
+                       ret = PTR_ERR(ctrl->tagset);
+                       goto out_free_io_queues;
+               }
+
+               if (ctrl->ops->flags & NVME_F_FABRICS) {
+                       ctrl->connect_q = blk_mq_init_queue(ctrl->tagset);
+                       if (IS_ERR(ctrl->connect_q)) {
+                               ret = PTR_ERR(ctrl->connect_q);
+                               goto out_free_tag_set;
+                       }
+               }
+       } else {
+               blk_mq_update_nr_hw_queues(ctrl->tagset,
+                       ctrl->queue_count - 1);
+       }
+
+       ret = nvme_tcp_start_io_queues(ctrl);
+       if (ret)
+               goto out_cleanup_connect_q;
+
+       return 0;
+
+out_cleanup_connect_q:
+       if (new && (ctrl->ops->flags & NVME_F_FABRICS))
+               blk_cleanup_queue(ctrl->connect_q);
+out_free_tag_set:
+       if (new)
+               nvme_tcp_free_tagset(ctrl, ctrl->tagset);
+out_free_io_queues:
+       nvme_tcp_free_io_queues(ctrl);
+       return ret;
+}
+
+void nvme_tcp_destroy_admin_queue(struct nvme_ctrl *ctrl, bool remove)
+{
+       nvme_tcp_stop_admin_queue(ctrl);
+       if (remove) {
+               free_opal_dev(ctrl->opal_dev);
+               blk_cleanup_queue(ctrl->admin_q);
+               nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+       }
+       nvme_tcp_free_admin_queue(ctrl);
+}
+
+int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
+{
+       int error;
+
+       error = nvme_tcp_alloc_admin_queue(ctrl);
+       if (error)
+               return error;
+
+       if (new) {
+               ctrl->admin_tagset = nvme_tcp_alloc_tagset(ctrl, true);
+               if (IS_ERR(ctrl->admin_tagset)) {
+                       error = PTR_ERR(ctrl->admin_tagset);
+                       goto out_free_queue;
+               }
+
+               ctrl->admin_q = blk_mq_init_queue(ctrl->admin_tagset);
+               if (IS_ERR(ctrl->admin_q)) {
+                       error = PTR_ERR(ctrl->admin_q);
+                       goto out_free_tagset;
+               }
+       }
+
+       error = nvme_tcp_start_admin_queue(ctrl);
+       if (error)
+               goto out_cleanup_queue;
+
+       error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap);
+       if (error) {
+               dev_err(ctrl->device,
+                       "prop_get NVME_REG_CAP failed\n");
+               goto out_stop_queue;
+       }
+
+       ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize);
+
+       error = nvme_enable_ctrl(ctrl, ctrl->cap);
+       if (error)
+               goto out_stop_queue;
+
+       error = nvme_init_identify(ctrl);
+       if (error)
+               goto out_stop_queue;
+
+       return 0;
+
+out_stop_queue:
+       nvme_tcp_stop_admin_queue(ctrl);
+out_cleanup_queue:
+       if (new)
+               blk_cleanup_queue(ctrl->admin_q);
+out_free_tagset:
+       if (new)
+               nvme_tcp_free_tagset(ctrl, ctrl->admin_tagset);
+out_free_queue:
+       nvme_tcp_free_admin_queue(ctrl);
+       return error;
+}
+
+static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
+               bool remove)
+{
+       blk_mq_quiesce_queue(ctrl->admin_q);
+       nvme_tcp_stop_admin_queue(ctrl);
+       blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl);
+       blk_mq_unquiesce_queue(ctrl->admin_q);
+       nvme_tcp_destroy_admin_queue(ctrl, remove);
+}
+
+static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
+               bool remove)
+{
+       if (ctrl->queue_count > 1) {
+               nvme_stop_queues(ctrl);
+               nvme_tcp_stop_io_queues(ctrl);
+               blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl);
+               if (remove)
+                       nvme_start_queues(ctrl);
+               nvme_tcp_destroy_io_queues(ctrl, remove);
+       }
+}
+
+static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
+{
+       /* If we are resetting/deleting then do nothing */
+       if (ctrl->state != NVME_CTRL_CONNECTING) {
+               WARN_ON_ONCE(ctrl->state == NVME_CTRL_NEW ||
+                       ctrl->state == NVME_CTRL_LIVE);
+               return;
+       }
+
+       if (nvmf_should_reconnect(ctrl)) {
+               dev_info(ctrl->device, "Reconnecting in %d seconds...\n",
+                       ctrl->opts->reconnect_delay);
+               queue_delayed_work(nvme_wq, &ctrl->connect_work,
+                               ctrl->opts->reconnect_delay * HZ);
+       } else {
+               dev_info(ctrl->device, "Removing controller...\n");
+               nvme_delete_ctrl(ctrl);
+       }
+}
+
+static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
+{
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+       int ret = -EINVAL;
+
+       ret = nvme_tcp_configure_admin_queue(ctrl, new);
+       if (ret)
+               return ret;
+
+       if (ctrl->icdoff) {
+               dev_err(ctrl->device, "icdoff is not supported!\n");
+               goto destroy_admin;
+       }
+
+       if (opts->queue_size > ctrl->sqsize + 1)
+               dev_warn(ctrl->device,
+                       "queue_size %zu > ctrl sqsize %u, clamping down\n",
+                       opts->queue_size, ctrl->sqsize + 1);
+
+       if (ctrl->sqsize + 1 > ctrl->maxcmd) {
+               dev_warn(ctrl->device,
+                       "sqsize %u > ctrl maxcmd %u, clamping down\n",
+                       ctrl->sqsize + 1, ctrl->maxcmd);
+               ctrl->sqsize = ctrl->maxcmd - 1;
+       }
+
+       if (ctrl->queue_count > 1) {
+               ret = nvme_tcp_configure_io_queues(ctrl, new);
+               if (ret)
+                       goto destroy_admin;
+       }
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               ret = -EINVAL;
+               goto destroy_io;
+       }
+
+       nvme_start_ctrl(ctrl);
+       return 0;
+
+destroy_io:
+       if (ctrl->queue_count > 1)
+               nvme_tcp_destroy_io_queues(ctrl, new);
+destroy_admin:
+       nvme_tcp_stop_admin_queue(ctrl);
+       nvme_tcp_destroy_admin_queue(ctrl, new);
+       return ret;
+}
+
+static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+                       struct nvme_ctrl, connect_work);
+
+       ++ctrl->nr_reconnects;
+
+       if (nvme_tcp_setup_ctrl(ctrl, false))
+               goto requeue;
+
+       dev_info(ctrl->device, "Successfully reconnected (%d attepmpt)\n",
+                       ctrl->nr_reconnects);
+
+       ctrl->nr_reconnects = 0;
+
+       return;
+
+requeue:
+       dev_info(ctrl->device, "Failed reconnect attempt %d\n",
+                       ctrl->nr_reconnects);
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_error_recovery_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl = container_of(work,
+                       struct nvme_ctrl, err_work);
+
+       nvme_stop_keep_alive(ctrl);
+       nvme_tcp_teardown_io_queues(ctrl, false);
+       /* unquiesce to fail fast pending requests */
+       nvme_start_queues(ctrl);
+       nvme_tcp_teardown_admin_queue(ctrl, false);
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               return;
+       }
+
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
+{
+       nvme_tcp_teardown_io_queues(ctrl, shutdown);
+       if (shutdown)
+               nvme_shutdown_ctrl(ctrl);
+       else
+               nvme_disable_ctrl(ctrl, ctrl->cap);
+       nvme_tcp_teardown_admin_queue(ctrl, shutdown);
+}
+
+static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
+{
+       nvme_tcp_teardown_ctrl(ctrl, true);
+}
+
+static void nvme_reset_ctrl_work(struct work_struct *work)
+{
+       struct nvme_ctrl *ctrl =
+               container_of(work, struct nvme_ctrl, reset_work);
+
+       nvme_stop_ctrl(ctrl);
+       nvme_tcp_teardown_ctrl(ctrl, false);
+
+       if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
+               /* state change failure is ok if we're in DELETING state */
+               WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING);
+               return;
+       }
+
+       if (nvme_tcp_setup_ctrl(ctrl, false))
+               goto out_fail;
+
+       return;
+
+out_fail:
+       ++ctrl->nr_reconnects;
+       nvme_tcp_reconnect_or_remove(ctrl);
+}
+
+static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
+{
+       cancel_work_sync(&ctrl->err_work);
+       cancel_delayed_work_sync(&ctrl->connect_work);
+}
+
+static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+
+       if (list_empty(&ctrl->list))
+               goto free_ctrl;
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_del(&ctrl->list);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       nvmf_free_options(nctrl->opts);
+free_ctrl:
+       kfree(ctrl->queues);
+       kfree(ctrl);
+}
+
+static void nvme_tcp_set_sg_null(struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = 0;
+       sg->length = 0;
+       sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+                       NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
+               struct nvme_tcp_request *req, struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
+       sg->length = cpu_to_le32(req->data_len);
+       sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
+}
+
+static void nvme_tcp_set_sg_host_data(struct nvme_tcp_request *req,
+               struct nvme_command *c)
+{
+       struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
+
+       sg->addr = 0;
+       sg->length = cpu_to_le32(req->data_len);
+       sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+                       NVME_SGL_FMT_TRANSPORT_A;
+}
+
+static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(arg);
+       struct nvme_tcp_queue *queue = &ctrl->queues[0];
+       struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
+       struct nvme_command *cmd = &pdu->cmd;
+       u8 hdgst = nvme_tcp_hdgst_len(queue);
+
+       memset(pdu, 0, sizeof(*pdu));
+       pdu->hdr.type = nvme_tcp_cmd;
+       if (queue->hdr_digest)
+               pdu->hdr.flags |= NVME_TCP_F_HDGST;
+       pdu->hdr.hlen = sizeof(*pdu);
+       pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst);
+
+       cmd->common.opcode = nvme_admin_async_event;
+       cmd->common.command_id = NVME_AQ_BLK_MQ_DEPTH;
+       cmd->common.flags |= NVME_CMD_SGL_METABUF;
+       nvme_tcp_set_sg_null(cmd);
+
+       ctrl->async_req.snd.state = NVME_TCP_SEND_CMD_PDU;
+       ctrl->async_req.snd.offset = 0;
+       ctrl->async_req.snd.curr_bio = NULL;
+       ctrl->async_req.rcv.curr_bio = NULL;
+       ctrl->async_req.data_len = 0;
+
+       nvme_tcp_queue_request(&ctrl->async_req);
+}
+
+static enum blk_eh_timer_return
+nvme_tcp_timeout(struct request *rq, bool reserved)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+
+       dev_dbg(ctrl->ctrl.device,
+               "queue %d: timeout request %#x type %d\n",
+               nvme_tcp_queue_id(req->queue), rq->tag,
+               pdu->hdr.type);
+
+       if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+               union nvme_result res = {};
+
+               nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+               nvme_end_request(rq, NVME_SC_ABORT_REQ, res);
+               return BLK_EH_DONE;
+       }
+
+       /* queue error recovery */
+       nvme_tcp_error_recovery(&ctrl->ctrl);
+
+       return BLK_EH_RESET_TIMER;
+}
+
+static blk_status_t nvme_tcp_map_data(struct nvme_tcp_queue *queue,
+                       struct request *rq)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       struct nvme_command *c = &pdu->cmd;
+
+       c->common.flags |= NVME_CMD_SGL_METABUF;
+
+       if (!req->data_len) {
+               nvme_tcp_set_sg_null(c);
+               return 0;
+       }
+
+       if (rq_data_dir(rq) == WRITE &&
+           req->data_len <= nvme_tcp_inline_data_size(queue))
+               nvme_tcp_set_sg_inline(queue, req, c);
+       else
+               nvme_tcp_set_sg_host_data(req, c);
+
+       return 0;
+}
+
+static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
+               struct request *rq)
+{
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+       struct nvme_tcp_queue *queue = req->queue;
+       u8 hdgst = nvme_tcp_hdgst_len(queue), ddgst = 0;
+       blk_status_t ret;
+
+       ret = nvme_setup_cmd(ns, rq, &pdu->cmd);
+       if (ret)
+               return ret;
+
+       req->snd.state = NVME_TCP_SEND_CMD_PDU;
+       req->snd.offset = 0;
+       req->snd.data_sent = 0;
+       req->pdu_len = 0;
+       req->pdu_sent = 0;
+       req->data_len = blk_rq_payload_bytes(rq);
+
+       if (rq_data_dir(rq) == WRITE) {
+               req->snd.curr_bio = rq->bio;
+               if (req->data_len <= nvme_tcp_inline_data_size(queue))
+                       req->pdu_len = req->data_len;
+       } else {
+               req->rcv.curr_bio = rq->bio;
+               if (req->rcv.curr_bio)
+                       nvme_tcp_init_recv_iter(req);
+       }
+
+       pdu->hdr.type = nvme_tcp_cmd;
+       pdu->hdr.flags = 0;
+       if (queue->hdr_digest)
+               pdu->hdr.flags |= NVME_TCP_F_HDGST;
+       if (queue->data_digest && req->pdu_len) {
+               pdu->hdr.flags |= NVME_TCP_F_DDGST;
+               ddgst = nvme_tcp_ddgst_len(queue);
+       }
+       pdu->hdr.hlen = sizeof(*pdu);
+       pdu->hdr.pdo = req->pdu_len ? pdu->hdr.hlen + hdgst : 0;
+       pdu->hdr.plen =
+               cpu_to_le32(pdu->hdr.hlen + hdgst + req->pdu_len + ddgst);
+
+       ret = nvme_tcp_map_data(queue, rq);
+       if (unlikely(ret)) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "Failed to map data (%d)\n", ret);
+               return ret;
+       }
+
+       return 0;
+}
+
+static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
+               const struct blk_mq_queue_data *bd)
+{
+       struct nvme_ns *ns = hctx->queue->queuedata;
+       struct nvme_tcp_queue *queue = hctx->driver_data;
+       struct request *rq = bd->rq;
+       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
+       blk_status_t ret;
+
+       if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
+               return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
+
+       ret = nvme_tcp_setup_cmd_pdu(ns, rq);
+       if (unlikely(ret))
+               return ret;
+
+       blk_mq_start_request(rq);
+
+       nvme_tcp_queue_request(req);
+
+       return BLK_STS_OK;
+}
+
+static struct blk_mq_ops nvme_tcp_mq_ops = {
+       .queue_rq       = nvme_tcp_queue_rq,
+       .complete       = nvme_complete_rq,
+       .init_request   = nvme_tcp_init_request,
+       .exit_request   = nvme_tcp_exit_request,
+       .init_hctx      = nvme_tcp_init_hctx,
+       .timeout        = nvme_tcp_timeout,
+};
+
+static struct blk_mq_ops nvme_tcp_admin_mq_ops = {
+       .queue_rq       = nvme_tcp_queue_rq,
+       .complete       = nvme_complete_rq,
+       .init_request   = nvme_tcp_init_request,
+       .exit_request   = nvme_tcp_exit_request,
+       .init_hctx      = nvme_tcp_init_admin_hctx,
+       .timeout        = nvme_tcp_timeout,
+};
+
+static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
+       .name                   = "tcp",
+       .module                 = THIS_MODULE,
+       .flags                  = NVME_F_FABRICS,
+       .reg_read32             = nvmf_reg_read32,
+       .reg_read64             = nvmf_reg_read64,
+       .reg_write32            = nvmf_reg_write32,
+       .free_ctrl              = nvme_tcp_free_ctrl,
+       .submit_async_event     = nvme_tcp_submit_async_event,
+       .delete_ctrl            = nvme_tcp_delete_ctrl,
+       .get_address            = nvmf_get_address,
+       .stop_ctrl              = nvme_tcp_stop_ctrl,
+};
+
+static bool
+nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts)
+{
+       struct nvme_tcp_ctrl *ctrl;
+       bool found = false;
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list) {
+               found = nvmf_ip_options_match(&ctrl->ctrl, opts);
+               if (found)
+                       break;
+       }
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       return found;
+}
+
+static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
+               struct nvmf_ctrl_options *opts)
+{
+       struct nvme_tcp_ctrl *ctrl;
+       int ret;
+
+       ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
+       if (!ctrl)
+               return ERR_PTR(-ENOMEM);
+
+       INIT_LIST_HEAD(&ctrl->list);
+       ctrl->ctrl.opts = opts;
+       ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
+       ctrl->ctrl.sqsize = opts->queue_size - 1;
+       ctrl->ctrl.kato = opts->kato;
+
+       INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
+                       nvme_tcp_reconnect_ctrl_work);
+       INIT_WORK(&ctrl->ctrl.err_work, nvme_tcp_error_recovery_work);
+       INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
+
+       if (!(opts->mask & NVMF_OPT_TRSVCID)) {
+               opts->trsvcid =
+                       kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
+               if (!opts->trsvcid) {
+                       ret = -ENOMEM;
+                       goto out_free_ctrl;
+               }
+               opts->mask |= NVMF_OPT_TRSVCID;
+       }
+
+       ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+                       opts->traddr, opts->trsvcid, &ctrl->addr);
+       if (ret) {
+               pr_err("malformed address passed: %s:%s\n",
+                       opts->traddr, opts->trsvcid);
+               goto out_free_ctrl;
+       }
+
+       if (opts->mask & NVMF_OPT_HOST_TRADDR) {
+               ret = inet_pton_with_scope(&init_net, AF_UNSPEC,
+                       opts->host_traddr, NULL, &ctrl->src_addr);
+               if (ret) {
+                       pr_err("malformed src address passed: %s\n",
+                              opts->host_traddr);
+                       goto out_free_ctrl;
+               }
+       }
+
+       if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) {
+               ret = -EALREADY;
+               goto out_free_ctrl;
+       }
+
+       ctrl->queues = kcalloc(opts->nr_io_queues + 1, sizeof(*ctrl->queues),
+                               GFP_KERNEL);
+       if (!ctrl->queues) {
+               ret = -ENOMEM;
+               goto out_free_ctrl;
+       }
+
+       ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0);
+       if (ret)
+               goto out_kfree_queues;
+
+       if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
+               WARN_ON_ONCE(1);
+               ret = -EINTR;
+               goto out_uninit_ctrl;
+       }
+
+       ret = nvme_tcp_setup_ctrl(&ctrl->ctrl, true);
+       if (ret)
+               goto out_uninit_ctrl;
+
+       dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+               ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
+
+       nvme_get_ctrl(&ctrl->ctrl);
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_add_tail(&ctrl->list, &nvme_tcp_ctrl_list);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+
+       return &ctrl->ctrl;
+
+out_uninit_ctrl:
+       nvme_uninit_ctrl(&ctrl->ctrl);
+       nvme_put_ctrl(&ctrl->ctrl);
+       if (ret > 0)
+               ret = -EIO;
+       return ERR_PTR(ret);
+out_kfree_queues:
+       kfree(ctrl->queues);
+out_free_ctrl:
+       kfree(ctrl);
+       return ERR_PTR(ret);
+}
+
+static struct nvmf_transport_ops nvme_tcp_transport = {
+       .name           = "tcp",
+       .module         = THIS_MODULE,
+       .required_opts  = NVMF_OPT_TRADDR,
+       .allowed_opts   = NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
+                         NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
+                         NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST,
+       .create_ctrl    = nvme_tcp_create_ctrl,
+};
+
+static int __init nvme_tcp_init_module(void)
+{
+       nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq",
+                       WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+       if (!nvme_tcp_wq)
+               return -ENOMEM;
+
+       nvmf_register_transport(&nvme_tcp_transport);
+       return 0;
+}
+
+static void __exit nvme_tcp_cleanup_module(void)
+{
+       struct nvme_tcp_ctrl *ctrl;
+
+       nvmf_unregister_transport(&nvme_tcp_transport);
+
+       mutex_lock(&nvme_tcp_ctrl_mutex);
+       list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
+               nvme_delete_ctrl(&ctrl->ctrl);
+       mutex_unlock(&nvme_tcp_ctrl_mutex);
+       flush_workqueue(nvme_delete_wq);
+
+       destroy_workqueue(nvme_tcp_wq);
+}
+
+module_init(nvme_tcp_init_module);
+module_exit(nvme_tcp_cleanup_module);
+
+MODULE_LICENSE("GPL v2");
--
2.17.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-20 23:34     ` Narayan Ayalasomayajula
  (?)
@ 2018-11-21  0:10       ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21  0:10 UTC (permalink / raw)
  To: Narayan Ayalasomayajula, linux-nvme
  Cc: linux-block, netdev, Keith Busch, David S. Miller, Christoph Hellwig


> Hi Sagi,
> 
>> +       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
>> +       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
>> +       icreq->hpda = 0; /* no alignment constraint */
> 
> The NVMe-TCP spec indicates that MAXR2T is a 0's-based value. To support a single inflight R2T as indicated in the comment above, icreq->maxr2t should be set to 0, right?

Correct, will fix in the next spin.

Thanks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-21  0:10       ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21  0:10 UTC (permalink / raw)
  To: Narayan Ayalasomayajula, linux-nvme
  Cc: linux-block, netdev, Keith Busch, David S. Miller, Christoph Hellwig


> Hi Sagi,
> 
>> +       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
>> +       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
>> +       icreq->hpda = 0; /* no alignment constraint */
> 
> The NVMe-TCP spec indicates that MAXR2T is a 0's-based value. To support a single inflight R2T as indicated in the comment above, icreq->maxr2t should be set to 0, right?

Correct, will fix in the next spin.

Thanks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-21  0:10       ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21  0:10 UTC (permalink / raw)



> Hi Sagi,
> 
>> +       icreq->pfv = cpu_to_le16(NVME_TCP_PFV_1_0);
>> +       icreq->maxr2t = cpu_to_le16(1); /* single inflight r2t supported */
>> +       icreq->hpda = 0; /* no alignment constraint */
> 
> The NVMe-TCP spec indicates that MAXR2T is a 0's-based value. To support a single inflight R2T as indicated in the comment above, icreq->maxr2t should be set to 0, right?

Correct, will fix in the next spin.

Thanks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21  0:41     ` Ethan Weidman
  -1 siblings, 0 replies; 76+ messages in thread
From: Ethan Weidman @ 2018-11-21  0:41 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig

>+static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
>+               struct nvmf_ctrl_options *opts)
>+{
>+       struct nvme_tcp_ctrl *ctrl;
>+       int ret;
>+
>+       ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
>+       if (!ctrl)
>+               return ERR_PTR(-ENOMEM);
>+
>+       INIT_LIST_HEAD(&ctrl->list);
>+       ctrl->ctrl.opts = opts;
>+       ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
>+       ctrl->ctrl.sqsize = opts->queue_size - 1;
>+       ctrl->ctrl.kato = opts->kato;
>+
>+       INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
>+                       nvme_tcp_reconnect_ctrl_work);
>+       INIT_WORK(&ctrl->ctrl.err_work, nvme_tcp_error_recovery_work);
>+       INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
>+
>+       if (!(opts->mask & NVMF_OPT_TRSVCID)) {
>+               opts->trsvcid =
>+                       kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
>+               if (!opts->trsvcid) {
>+                       ret = -ENOMEM;
>+                       goto out_free_ctrl;
>+               }
>+               opts->mask |= NVMF_OPT_TRSVCID;
>+       }

NVME_TCP_DISC_PORT, 8009, should not be the default for nvme fabric connections when trsvcid is not specified.

NVME_RDMA_IP_PORT, 4420, should be used, or renamed to something more generic.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-21  0:41     ` Ethan Weidman
  0 siblings, 0 replies; 76+ messages in thread
From: Ethan Weidman @ 2018-11-21  0:41 UTC (permalink / raw)


>+static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
>+               struct nvmf_ctrl_options *opts)
>+{
>+       struct nvme_tcp_ctrl *ctrl;
>+       int ret;
>+
>+       ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
>+       if (!ctrl)
>+               return ERR_PTR(-ENOMEM);
>+
>+       INIT_LIST_HEAD(&ctrl->list);
>+       ctrl->ctrl.opts = opts;
>+       ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
>+       ctrl->ctrl.sqsize = opts->queue_size - 1;
>+       ctrl->ctrl.kato = opts->kato;
>+
>+       INIT_DELAYED_WORK(&ctrl->ctrl.connect_work,
>+                       nvme_tcp_reconnect_ctrl_work);
>+       INIT_WORK(&ctrl->ctrl.err_work, nvme_tcp_error_recovery_work);
>+       INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
>+
>+       if (!(opts->mask & NVMF_OPT_TRSVCID)) {
>+               opts->trsvcid =
>+                       kstrdup(__stringify(NVME_TCP_DISC_PORT), GFP_KERNEL);
>+               if (!opts->trsvcid) {
>+                       ret = -ENOMEM;
>+                       goto out_free_ctrl;
>+               }
>+               opts->mask |= NVMF_OPT_TRSVCID;
>+       }

NVME_TCP_DISC_PORT, 8009, should not be the default for nvme fabric connections when trsvcid is not specified.

NVME_RDMA_IP_PORT, 4420, should be used, or renamed to something more generic.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-21  0:41     ` Ethan Weidman
@ 2018-11-21  5:43       ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21  5:43 UTC (permalink / raw)
  To: Ethan Weidman, linux-nvme
  Cc: linux-block, netdev, David S. Miller, Keith Busch, Christoph Hellwig


> NVME_TCP_DISC_PORT, 8009, should not be the default for nvme fabric connections when trsvcid is not specified.

Its allowed, the statement simply does not recommend it as a possible
misuse in an environment where iWARP and TCP are deployed on the same
port spaces...

And, it recommends that subsystems don't listen on the 8009 by default,
not anything about a host. Furthermore, usually controllers that are not
passed with a default port are discovery controllers anyways.

> NVME_RDMA_IP_PORT, 4420, should be used, or renamed to something more generic.

Not at all... in fact, this is really not what the spec is intending.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-21  5:43       ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21  5:43 UTC (permalink / raw)



> NVME_TCP_DISC_PORT, 8009, should not be the default for nvme fabric connections when trsvcid is not specified.

Its allowed, the statement simply does not recommend it as a possible
misuse in an environment where iWARP and TCP are deployed on the same
port spaces...

And, it recommends that subsystems don't listen on the 8009 by default,
not anything about a host. Furthermore, usually controllers that are not
passed with a default port are discovery controllers anyways.

> NVME_RDMA_IP_PORT, 4420, should be used, or renamed to something more generic.

Not at all... in fact, this is really not what the spec is intending.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21  8:56     ` Christoph Hellwig
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21  8:56 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig

On Mon, Nov 19, 2018 at 07:00:16PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi@lightbitslabs.com>
> 
> This patch implements the NVMe over TCP host driver. It can be used to
> connect to remote NVMe over Fabrics subsystems over good old TCP/IP.
> 
> The driver implements the TP 8000 of how nvme over fabrics capsules and
> data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
> stream. nvme-tcp header and data digest are supported as well.
> 
> To connect to all NVMe over Fabrics controllers reachable on a given taget
> port over TCP use the following command:
> 
> 	nvme connect-all -t tcp -a $IPADDR
> 
> This requires the latest version of nvme-cli with TCP support.
> 
> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
> Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
> Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
> ---
>  drivers/nvme/host/Kconfig  |   15 +
>  drivers/nvme/host/Makefile |    3 +
>  drivers/nvme/host/tcp.c    | 2306 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 2324 insertions(+)
>  create mode 100644 drivers/nvme/host/tcp.c
> 
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index 88a8b5916624..0f345e207675 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -57,3 +57,18 @@ config NVME_FC
>  	  from https://github.com/linux-nvme/nvme-cli.
>  
>  	  If unsure, say N.
> +
> +config NVME_TCP
> +	tristate "NVM Express over Fabrics TCP host driver"
> +	depends on INET
> +	depends on BLK_DEV_NVME
> +	select NVME_FABRICS
> +	help
> +	  This provides support for the NVMe over Fabrics protocol using
> +	  the TCP transport.  This allows you to use remote block devices
> +	  exported using the NVMe protocol set.
> +
> +	  To configure a NVMe over Fabrics controller use the nvme-cli tool
> +	  from https://github.com/linux-nvme/nvme-cli.
> +
> +	  If unsure, say N.
> diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
> index aea459c65ae1..8a4b671c5f0c 100644
> --- a/drivers/nvme/host/Makefile
> +++ b/drivers/nvme/host/Makefile
> @@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME)		+= nvme.o
>  obj-$(CONFIG_NVME_FABRICS)		+= nvme-fabrics.o
>  obj-$(CONFIG_NVME_RDMA)			+= nvme-rdma.o
>  obj-$(CONFIG_NVME_FC)			+= nvme-fc.o
> +obj-$(CONFIG_NVME_TCP)			+= nvme-tcp.o
>  
>  nvme-core-y				:= core.o
>  nvme-core-$(CONFIG_TRACING)		+= trace.o
> @@ -21,3 +22,5 @@ nvme-fabrics-y				+= fabrics.o
>  nvme-rdma-y				+= rdma.o
>  
>  nvme-fc-y				+= fc.o
> +
> +nvme-tcp-y				+= tcp.o
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> new file mode 100644
> index 000000000000..4c583859f0ad
> --- /dev/null
> +++ b/drivers/nvme/host/tcp.c
> @@ -0,0 +1,2306 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * NVMe over Fabrics TCP host.
> + * Copyright (c) 2018 LightBits Labs. All rights reserved.
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/err.h>
> +#include <linux/nvme-tcp.h>
> +#include <net/sock.h>
> +#include <net/tcp.h>
> +#include <linux/blk-mq.h>
> +#include <crypto/hash.h>
> +
> +#include "nvme.h"
> +#include "fabrics.h"
> +
> +struct nvme_tcp_queue;
> +
> +enum nvme_tcp_send_state {
> +	NVME_TCP_SEND_CMD_PDU = 0,
> +	NVME_TCP_SEND_H2C_PDU,
> +	NVME_TCP_SEND_DATA,
> +	NVME_TCP_SEND_DDGST,
> +};
> +
> +struct nvme_tcp_send_ctx {
> +	struct bio		*curr_bio;
> +	struct iov_iter		iter;
> +	size_t			offset;
> +	size_t			data_sent;
> +	enum nvme_tcp_send_state state;
> +};
> +
> +struct nvme_tcp_recv_ctx {
> +	struct iov_iter		iter;
> +	struct bio		*curr_bio;
> +};

I don't understand these structures.  There should only be
a bio to be send or receive, not both.  Why do we need two
curr_bio pointers?

To me it seems like both structures should just go away and
move into nvme_tcp_request ala:


	struct bio		*curr_bio;

	/* send state */
	struct iov_iter		send_iter;
	size_t			send_offset;
	enum nvme_tcp_send_state send_state;
	size_t			data_sent;

	/* receive state */
	struct iov_iter		recv_iter;


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-21  8:56     ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21  8:56 UTC (permalink / raw)


On Mon, Nov 19, 2018@07:00:16PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi at lightbitslabs.com>
> 
> This patch implements the NVMe over TCP host driver. It can be used to
> connect to remote NVMe over Fabrics subsystems over good old TCP/IP.
> 
> The driver implements the TP 8000 of how nvme over fabrics capsules and
> data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
> stream. nvme-tcp header and data digest are supported as well.
> 
> To connect to all NVMe over Fabrics controllers reachable on a given taget
> port over TCP use the following command:
> 
> 	nvme connect-all -t tcp -a $IPADDR
> 
> This requires the latest version of nvme-cli with TCP support.
> 
> Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>
> Signed-off-by: Roy Shterman <roys at lightbitslabs.com>
> Signed-off-by: Solganik Alexander <sashas at lightbitslabs.com>
> ---
>  drivers/nvme/host/Kconfig  |   15 +
>  drivers/nvme/host/Makefile |    3 +
>  drivers/nvme/host/tcp.c    | 2306 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 2324 insertions(+)
>  create mode 100644 drivers/nvme/host/tcp.c
> 
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index 88a8b5916624..0f345e207675 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -57,3 +57,18 @@ config NVME_FC
>  	  from https://github.com/linux-nvme/nvme-cli.
>  
>  	  If unsure, say N.
> +
> +config NVME_TCP
> +	tristate "NVM Express over Fabrics TCP host driver"
> +	depends on INET
> +	depends on BLK_DEV_NVME
> +	select NVME_FABRICS
> +	help
> +	  This provides support for the NVMe over Fabrics protocol using
> +	  the TCP transport.  This allows you to use remote block devices
> +	  exported using the NVMe protocol set.
> +
> +	  To configure a NVMe over Fabrics controller use the nvme-cli tool
> +	  from https://github.com/linux-nvme/nvme-cli.
> +
> +	  If unsure, say N.
> diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
> index aea459c65ae1..8a4b671c5f0c 100644
> --- a/drivers/nvme/host/Makefile
> +++ b/drivers/nvme/host/Makefile
> @@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME)		+= nvme.o
>  obj-$(CONFIG_NVME_FABRICS)		+= nvme-fabrics.o
>  obj-$(CONFIG_NVME_RDMA)			+= nvme-rdma.o
>  obj-$(CONFIG_NVME_FC)			+= nvme-fc.o
> +obj-$(CONFIG_NVME_TCP)			+= nvme-tcp.o
>  
>  nvme-core-y				:= core.o
>  nvme-core-$(CONFIG_TRACING)		+= trace.o
> @@ -21,3 +22,5 @@ nvme-fabrics-y				+= fabrics.o
>  nvme-rdma-y				+= rdma.o
>  
>  nvme-fc-y				+= fc.o
> +
> +nvme-tcp-y				+= tcp.o
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> new file mode 100644
> index 000000000000..4c583859f0ad
> --- /dev/null
> +++ b/drivers/nvme/host/tcp.c
> @@ -0,0 +1,2306 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * NVMe over Fabrics TCP host.
> + * Copyright (c) 2018 LightBits Labs. All rights reserved.
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/err.h>
> +#include <linux/nvme-tcp.h>
> +#include <net/sock.h>
> +#include <net/tcp.h>
> +#include <linux/blk-mq.h>
> +#include <crypto/hash.h>
> +
> +#include "nvme.h"
> +#include "fabrics.h"
> +
> +struct nvme_tcp_queue;
> +
> +enum nvme_tcp_send_state {
> +	NVME_TCP_SEND_CMD_PDU = 0,
> +	NVME_TCP_SEND_H2C_PDU,
> +	NVME_TCP_SEND_DATA,
> +	NVME_TCP_SEND_DDGST,
> +};
> +
> +struct nvme_tcp_send_ctx {
> +	struct bio		*curr_bio;
> +	struct iov_iter		iter;
> +	size_t			offset;
> +	size_t			data_sent;
> +	enum nvme_tcp_send_state state;
> +};
> +
> +struct nvme_tcp_recv_ctx {
> +	struct iov_iter		iter;
> +	struct bio		*curr_bio;
> +};

I don't understand these structures.  There should only be
a bio to be send or receive, not both.  Why do we need two
curr_bio pointers?

To me it seems like both structures should just go away and
move into nvme_tcp_request ala:


	struct bio		*curr_bio;

	/* send state */
	struct iov_iter		send_iter;
	size_t			send_offset;
	enum nvme_tcp_send_state send_state;
	size_t			data_sent;

	/* receive state */
	struct iov_iter		recv_iter;

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21 12:01     ` Mikhail Skorzhinskii
  -1 siblings, 0 replies; 76+ messages in thread
From: Mikhail Skorzhinskii @ 2018-11-21 12:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig

Sagi Grimberg <sagi@grimberg.me> writes:
 > +static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
 > +{
 > +	struct nvme_tcp_queue *queue = req->queue;
 > +
 > +	spin_lock_bh(&queue->lock);
 > +	list_add_tail(&req->entry, &queue->send_list);
 > +	spin_unlock_bh(&queue->lock);
 > +
 > +	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 > +}

May be I missing something, but why bother with bottom half version of
locking?

There are few places where this lock could be accessed:

 (1) From ->queue_rq() call;
 (2) From submitting new AEN request;
 (3) From receiving new R2T;

Which one if these originates from bottom half? Not 100% about queue_rq
data path, but (2) and (3) looks perfectly safe for me.

Possibly just a relic of some previous iterations of experimenting?

Mikhail Skorzhinskii

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-21 12:01     ` Mikhail Skorzhinskii
  0 siblings, 0 replies; 76+ messages in thread
From: Mikhail Skorzhinskii @ 2018-11-21 12:01 UTC (permalink / raw)


Sagi Grimberg <sagi at grimberg.me> writes:
 > +static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
 > +{
 > +	struct nvme_tcp_queue *queue = req->queue;
 > +
 > +	spin_lock_bh(&queue->lock);
 > +	list_add_tail(&req->entry, &queue->send_list);
 > +	spin_unlock_bh(&queue->lock);
 > +
 > +	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 > +}

May be I missing something, but why bother with bottom half version of
locking?

There are few places where this lock could be accessed:

 (1) From ->queue_rq() call;
 (2) From submitting new AEN request;
 (3) From receiving new R2T;

Which one if these originates from bottom half? Not 100% about queue_rq
data path, but (2) and (3) looks perfectly safe for me.

Possibly just a relic of some previous iterations of experimenting?

Mikhail Skorzhinskii

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 07/14] nvme-core: add work elements to struct nvme_ctrl
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21 13:04     ` Christoph Hellwig
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:04 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig

On Mon, Nov 19, 2018 at 07:00:09PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi@lightbitslabs.com>
> 
> connect_work and err_work will be reused by nvme-tcp so
> share those in nvme_ctrl for rdma and fc to share.

As said before when you sent this invididually:  I'd rather not move
struct members to the common struture until we actually start using
it there.  So for now please add it to the containing TCP-specific
structure, although I'm looking forward to actually see code around
it consolidated eventually.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 07/14] nvme-core: add work elements to struct nvme_ctrl
@ 2018-11-21 13:04     ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:04 UTC (permalink / raw)


On Mon, Nov 19, 2018@07:00:09PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi at lightbitslabs.com>
> 
> connect_work and err_work will be reused by nvme-tcp so
> share those in nvme_ctrl for rdma and fc to share.

As said before when you sent this invididually:  I'd rather not move
struct members to the common struture until we actually start using
it there.  So for now please add it to the containing TCP-specific
structure, although I'm looking forward to actually see code around
it consolidated eventually.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 08/14] nvmet: Add install_queue callout
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21 13:05     ` Christoph Hellwig
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:05 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 08/14] nvmet: Add install_queue callout
@ 2018-11-21 13:05     ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:05 UTC (permalink / raw)


Looks fine,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 09/14] nvmet: allow configfs tcp trtype configuration
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21 13:05     ` Christoph Hellwig
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:05 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig

On Mon, Nov 19, 2018 at 07:00:11PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi@lightbitslabs.com>
> 
> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 09/14] nvmet: allow configfs tcp trtype configuration
@ 2018-11-21 13:05     ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:05 UTC (permalink / raw)


On Mon, Nov 19, 2018@07:00:11PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi at lightbitslabs.com>
> 
> Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
> Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 10/14] nvme-fabrics: allow user passing header digest
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21 13:06     ` Christoph Hellwig
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:06 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig

On Mon, Nov 19, 2018 at 07:00:12PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi@lightbitslabs.com>
> 
> Header digest is a nvme-tcp specific feature, but nothing prevents other
> transports reusing the concept so do not associate with tcp transport
> solely.
> 
> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 10/14] nvme-fabrics: allow user passing header digest
@ 2018-11-21 13:06     ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:06 UTC (permalink / raw)


On Mon, Nov 19, 2018@07:00:12PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi at lightbitslabs.com>
> 
> Header digest is a nvme-tcp specific feature, but nothing prevents other
> transports reusing the concept so do not associate with tcp transport
> solely.
> 
> Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 11/14] nvme-fabrics: allow user passing data digest
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21 13:06     ` Christoph Hellwig
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:06 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig

On Mon, Nov 19, 2018 at 07:00:13PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi@lightbitslabs.com>
> 
> Data digest is a nvme-tcp specific feature, but nothing prevents other
> transports reusing the concept so do not associate with tcp transport
> solely.
> 
> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 11/14] nvme-fabrics: allow user passing data digest
@ 2018-11-21 13:06     ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:06 UTC (permalink / raw)


On Mon, Nov 19, 2018@07:00:13PM -0800, Sagi Grimberg wrote:
> From: Sagi Grimberg <sagi at lightbitslabs.com>
> 
> Data digest is a nvme-tcp specific feature, but nothing prevents other
> transports reusing the concept so do not associate with tcp transport
> solely.
> 
> Signed-off-by: Sagi Grimberg <sagi at lightbitslabs.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 13/14] nvmet-tcp: add NVMe over TCP target driver
  2018-11-20  3:00   ` Sagi Grimberg
@ 2018-11-21 13:07     ` Christoph Hellwig
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:07 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig

> diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
> index 33c8afaf63bd..685780d1ed04 100644
> --- a/include/linux/nvme-tcp.h
> +++ b/include/linux/nvme-tcp.h
> @@ -11,6 +11,7 @@
>  
>  #define NVME_TCP_DISC_PORT	8009
>  #define NVME_TCP_ADMIN_CCSZ	SZ_8K
> +#define NVME_TCP_DIGEST_LENGTH	4

This should go into the previous patch.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 13/14] nvmet-tcp: add NVMe over TCP target driver
@ 2018-11-21 13:07     ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2018-11-21 13:07 UTC (permalink / raw)


> diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
> index 33c8afaf63bd..685780d1ed04 100644
> --- a/include/linux/nvme-tcp.h
> +++ b/include/linux/nvme-tcp.h
> @@ -11,6 +11,7 @@
>  
>  #define NVME_TCP_DISC_PORT	8009
>  #define NVME_TCP_ADMIN_CCSZ	SZ_8K
> +#define NVME_TCP_DIGEST_LENGTH	4

This should go into the previous patch.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-21  8:56     ` Christoph Hellwig
@ 2018-11-21 22:27       ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21 22:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch


>> +struct nvme_tcp_send_ctx {
>> +	struct bio		*curr_bio;
>> +	struct iov_iter		iter;
>> +	size_t			offset;
>> +	size_t			data_sent;
>> +	enum nvme_tcp_send_state state;
>> +};
>> +
>> +struct nvme_tcp_recv_ctx {
>> +	struct iov_iter		iter;
>> +	struct bio		*curr_bio;
>> +};
> 
> I don't understand these structures.  There should only be
> a bio to be send or receive, not both.  Why do we need two
> curr_bio pointers?

We don't really need both...

> To me it seems like both structures should just go away and
> move into nvme_tcp_request ala:
> 
> 
> 	struct bio		*curr_bio;
> 
> 	/* send state */
> 	struct iov_iter		send_iter;
> 	size_t			send_offset;
> 	enum nvme_tcp_send_state send_state;
> 	size_t			data_sent;
> 
> 	/* receive state */
> 	struct iov_iter		recv_iter;
> 

Sure, will move this.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-21 22:27       ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21 22:27 UTC (permalink / raw)



>> +struct nvme_tcp_send_ctx {
>> +	struct bio		*curr_bio;
>> +	struct iov_iter		iter;
>> +	size_t			offset;
>> +	size_t			data_sent;
>> +	enum nvme_tcp_send_state state;
>> +};
>> +
>> +struct nvme_tcp_recv_ctx {
>> +	struct iov_iter		iter;
>> +	struct bio		*curr_bio;
>> +};
> 
> I don't understand these structures.  There should only be
> a bio to be send or receive, not both.  Why do we need two
> curr_bio pointers?

We don't really need both...

> To me it seems like both structures should just go away and
> move into nvme_tcp_request ala:
> 
> 
> 	struct bio		*curr_bio;
> 
> 	/* send state */
> 	struct iov_iter		send_iter;
> 	size_t			send_offset;
> 	enum nvme_tcp_send_state send_state;
> 	size_t			data_sent;
> 
> 	/* receive state */
> 	struct iov_iter		recv_iter;
> 

Sure, will move this.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
  2018-11-21 12:01     ` Mikhail Skorzhinskii
@ 2018-11-21 22:28       ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21 22:28 UTC (permalink / raw)
  To: Mikhail Skorzhinskii
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch,
	Christoph Hellwig



On 11/21/18 4:01 AM, Mikhail Skorzhinskii wrote:
> Sagi Grimberg <sagi@grimberg.me> writes:
>   > +static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
>   > +{
>   > +	struct nvme_tcp_queue *queue = req->queue;
>   > +
>   > +	spin_lock_bh(&queue->lock);
>   > +	list_add_tail(&req->entry, &queue->send_list);
>   > +	spin_unlock_bh(&queue->lock);
>   > +
>   > +	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>   > +}
> 
> May be I missing something, but why bother with bottom half version of
> locking?
> 
> There are few places where this lock could be accessed:
> 
>   (1) From ->queue_rq() call;
>   (2) From submitting new AEN request;
>   (3) From receiving new R2T;
> 
> Which one if these originates from bottom half? Not 100% about queue_rq
> data path, but (2) and (3) looks perfectly safe for me.

Actually, (3) in former versions was invoked in a soft-irq context which
was why this included the bh disable, but I guess it can be removed now.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver
@ 2018-11-21 22:28       ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21 22:28 UTC (permalink / raw)




On 11/21/18 4:01 AM, Mikhail Skorzhinskii wrote:
> Sagi Grimberg <sagi at grimberg.me> writes:
>   > +static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req)
>   > +{
>   > +	struct nvme_tcp_queue *queue = req->queue;
>   > +
>   > +	spin_lock_bh(&queue->lock);
>   > +	list_add_tail(&req->entry, &queue->send_list);
>   > +	spin_unlock_bh(&queue->lock);
>   > +
>   > +	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>   > +}
> 
> May be I missing something, but why bother with bottom half version of
> locking?
> 
> There are few places where this lock could be accessed:
> 
>   (1) From ->queue_rq() call;
>   (2) From submitting new AEN request;
>   (3) From receiving new R2T;
> 
> Which one if these originates from bottom half? Not 100% about queue_rq
> data path, but (2) and (3) looks perfectly safe for me.

Actually, (3) in former versions was invoked in a soft-irq context which
was why this included the bh disable, but I guess it can be removed now.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 07/14] nvme-core: add work elements to struct nvme_ctrl
  2018-11-21 13:04     ` Christoph Hellwig
@ 2018-11-21 22:28       ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21 22:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch


>> connect_work and err_work will be reused by nvme-tcp so
>> share those in nvme_ctrl for rdma and fc to share.
> 
> As said before when you sent this invididually:  I'd rather not move
> struct members to the common struture until we actually start using
> it there.  So for now please add it to the containing TCP-specific
> structure, although I'm looking forward to actually see code around
> it consolidated eventually.

FFFine....

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 07/14] nvme-core: add work elements to struct nvme_ctrl
@ 2018-11-21 22:28       ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21 22:28 UTC (permalink / raw)



>> connect_work and err_work will be reused by nvme-tcp so
>> share those in nvme_ctrl for rdma and fc to share.
> 
> As said before when you sent this invididually:  I'd rather not move
> struct members to the common struture until we actually start using
> it there.  So for now please add it to the containing TCP-specific
> structure, although I'm looking forward to actually see code around
> it consolidated eventually.

FFFine....

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 13/14] nvmet-tcp: add NVMe over TCP target driver
  2018-11-21 13:07     ` Christoph Hellwig
@ 2018-11-21 22:29       ` Sagi Grimberg
  -1 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21 22:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-block, netdev, David S. Miller, Keith Busch


>>   #define NVME_TCP_DISC_PORT	8009
>>   #define NVME_TCP_ADMIN_CCSZ	SZ_8K
>> +#define NVME_TCP_DIGEST_LENGTH	4
> 
> This should go into the previous patch.

Correct.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 13/14] nvmet-tcp: add NVMe over TCP target driver
@ 2018-11-21 22:29       ` Sagi Grimberg
  0 siblings, 0 replies; 76+ messages in thread
From: Sagi Grimberg @ 2018-11-21 22:29 UTC (permalink / raw)



>>   #define NVME_TCP_DISC_PORT	8009
>>   #define NVME_TCP_ADMIN_CCSZ	SZ_8K
>> +#define NVME_TCP_DIGEST_LENGTH	4
> 
> This should go into the previous patch.

Correct.

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2018-11-21 22:29 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-20  3:00 [PATCH v2 00/14] TCP transport binding for NVMe over Fabrics Sagi Grimberg
2018-11-20  3:00 ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 01/14] ath6kl: add ath6kl_ prefix to crypto_type Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 02/14] datagram: open-code copy_page_to_iter Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 03/14] iov_iter: pass void csum pointer to csum_and_copy_to_iter Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 04/14] datagram: consolidate datagram copy to iter helpers Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 04/14] net/datagram: " Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 05/14] iov_iter: introduce hash_and_copy_to_iter helper Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 06/14] datagram: introduce skb_copy_and_hash_datagram_iter helper Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 07/14] nvme-core: add work elements to struct nvme_ctrl Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-21 13:04   ` Christoph Hellwig
2018-11-21 13:04     ` Christoph Hellwig
2018-11-21 22:28     ` Sagi Grimberg
2018-11-21 22:28       ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 08/14] nvmet: Add install_queue callout Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-21 13:05   ` Christoph Hellwig
2018-11-21 13:05     ` Christoph Hellwig
2018-11-20  3:00 ` [PATCH v2 09/14] nvmet: allow configfs tcp trtype configuration Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-21 13:05   ` Christoph Hellwig
2018-11-21 13:05     ` Christoph Hellwig
2018-11-20  3:00 ` [PATCH v2 10/14] nvme-fabrics: allow user passing header digest Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-21 13:06   ` Christoph Hellwig
2018-11-21 13:06     ` Christoph Hellwig
2018-11-20  3:00 ` [PATCH v2 11/14] nvme-fabrics: allow user passing data digest Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-21 13:06   ` Christoph Hellwig
2018-11-21 13:06     ` Christoph Hellwig
2018-11-20  3:00 ` [PATCH v2 12/14] nvme-tcp: Add protocol header Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 13/14] nvmet-tcp: add NVMe over TCP target driver Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-21 13:07   ` Christoph Hellwig
2018-11-21 13:07     ` Christoph Hellwig
2018-11-21 22:29     ` Sagi Grimberg
2018-11-21 22:29       ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH v2 14/14] nvme-tcp: add NVMe over TCP host driver Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20 23:34   ` Narayan Ayalasomayajula
2018-11-20 23:34     ` Narayan Ayalasomayajula
2018-11-20 23:34     ` Narayan Ayalasomayajula
2018-11-21  0:10     ` Sagi Grimberg
2018-11-21  0:10       ` Sagi Grimberg
2018-11-21  0:10       ` Sagi Grimberg
2018-11-21  0:41   ` Ethan Weidman
2018-11-21  0:41     ` Ethan Weidman
2018-11-21  5:43     ` Sagi Grimberg
2018-11-21  5:43       ` Sagi Grimberg
2018-11-21  8:56   ` Christoph Hellwig
2018-11-21  8:56     ` Christoph Hellwig
2018-11-21 22:27     ` Sagi Grimberg
2018-11-21 22:27       ` Sagi Grimberg
2018-11-21 12:01   ` Mikhail Skorzhinskii
2018-11-21 12:01     ` Mikhail Skorzhinskii
2018-11-21 22:28     ` Sagi Grimberg
2018-11-21 22:28       ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH nvme-cli v2 15/14] nvme: Add TCP transport Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  9:36   ` Arend van Spriel
2018-11-20  9:36     ` Arend van Spriel
2018-11-20 22:56     ` Sagi Grimberg
2018-11-20 22:56       ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH nvme-cli v2 16/14] fabrics: add tcp port tsas decoding Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg
2018-11-20  3:00 ` [PATCH nvme-cli v2 17/14] fabrics: add transport header and data digest Sagi Grimberg
2018-11-20  3:00   ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.