All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-08-20 19:04 Long Li
  2017-08-20 19:04 ` [Patch v2 03/19] CIFS: SMBD: " Long Li
                   ` (17 more replies)
  0 siblings, 18 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport protocol for transferring upper layer (SMB2) payload over RDMA via Infiniband, RoCE or iWARP. The prococol is published in [MS-SMBD] (https://msdn.microsoft.com/en-us/library/hh536346.aspx).

The patch v2 added RDMA read/write via memory registration, and addressed feedbacks on v1.

Long Li (19):
  CIFS: Add rdma mount option
  CIFS: SMBD: Add SMBDirect protocol and transport constants
  CIFS: SMBD: Implement SMBDirect transport
  CIFS: SMBD: Add SMBDirect transport to SMB connection and Makefile
  CIFS: SMBD: Connect to SMBDirect session
  CIFS: SMBD: Reconnect to SMBDirect session
  CIFS: SMBD: Destroy SMBDirect session on shutdown or umount
  CIFS: SMBD: Set SMBDirect maximum read or write size for I/O
  CIFS: SMBD: Read data from SMBDirect
  CIFS: SMBD: Send data through SMBDirect
  CIFS: SMBD: Define memory registration for I/O data
  CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE
  CIFS: SMBD: Use registered memory RDMA read for SMB write
  CIFS: SMBD: Deregister memory when finishing SMB write
  CIFS: SMBD: Add parameter rdata to smb2_new_read_req
  CIFS: SMBD: Read correct returned data length for RDMA write (SMB
    READ) I/O
  CIFS: SMBD: Do not read from transport on registered memory RDMA write
    (SMB READ)
  CIFS: SMBD: Deregister memory when finishing SMB read
  CIFS: SMBD: Add SMBDirect debug counters

 fs/cifs/Makefile     |    2 +-
 fs/cifs/cifs_debug.c |   48 ++
 fs/cifs/cifsfs.c     |    2 +
 fs/cifs/cifsglob.h   |   17 +-
 fs/cifs/cifssmb.c    |    4 +-
 fs/cifs/connect.c    |   62 +-
 fs/cifs/file.c       |    5 +
 fs/cifs/smb1ops.c    |    2 +-
 fs/cifs/smb2ops.c    |   21 +-
 fs/cifs/smb2pdu.c    |  114 ++-
 fs/cifs/smb2pdu.h    |    2 +-
 fs/cifs/smbdirect.c  | 2328 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/cifs/smbdirect.h  |  300 +++++++
 fs/cifs/transport.c  |    7 +
 14 files changed, 2895 insertions(+), 19 deletions(-)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Patch v2 01/19] CIFS: Add RDMA mount option
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
@ 2017-08-20 19:04     ` Long Li
  2017-08-20 19:04 ` [Patch v2 04/19] CIFS: SMBD: Add SMBDirect transport to SMB connection and Makefile Long Li
                       ` (16 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox
  Cc: Long Li

From: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>

Add "rdma" to CIFS mount option, which tells CIFS this is for connecting to a SMB server over SMBDirect. Add checks to validate this feature is only used on SMB 3.X dialects.

To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".

Signed-off-by: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
---
 fs/cifs/cifs_debug.c |  2 ++
 fs/cifs/cifsfs.c     |  2 ++
 fs/cifs/cifsglob.h   |  3 +++
 fs/cifs/connect.c    | 25 ++++++++++++++++++++++++-
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9727e1d..ba0870d 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
 				ses->ses_count, ses->serverOS, ses->serverNOS,
 				ses->capabilities, ses->status);
 			}
+			if (server->rdma)
+				seq_printf(m, "RDMA\n\t");
 			seq_printf(m, "TCP status: %d\n\tLocal Users To "
 				   "Server: %d SecMode: 0x%x Req On Wire: %d",
 				   server->tcpStatus, server->srv_count,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index fe0c8dc..a628800 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -330,6 +330,8 @@ cifs_show_address(struct seq_file *s, struct TCP_Server_Info *server)
 	default:
 		seq_puts(s, "(unknown)");
 	}
+	if (server->rdma)
+		seq_puts(s, ",rdma");
 }
 
 static void
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 8289f95..703c2fb 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -531,6 +531,7 @@ struct smb_vol {
 	bool nopersistent:1;
 	bool resilient:1; /* noresilient not required since not fored for CA */
 	bool domainauto:1;
+	bool rdma:1;
 	unsigned int rsize;
 	unsigned int wsize;
 	bool sockopt_tcp_nodelay:1;
@@ -649,6 +650,8 @@ struct TCP_Server_Info {
 	bool	sec_kerberos;		/* supports plain Kerberos */
 	bool	sec_mskerberos;		/* supports legacy MS Kerberos */
 	bool	large_buf;		/* is current buffer large? */
+	/* use SMBD connection instead of socket */
+	bool	rdma;
 	struct delayed_work	echo; /* echo ping workqueue job */
 	char	*smallbuf;	/* pointer to current "small" buffer */
 	char	*bigbuf;	/* pointer to current "big" buffer */
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 2eeaac6..d5d0ecd 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -94,7 +94,7 @@ enum {
 	Opt_multiuser, Opt_sloppy, Opt_nosharesock,
 	Opt_persistent, Opt_nopersistent,
 	Opt_resilient, Opt_noresilient,
-	Opt_domainauto,
+	Opt_domainauto, Opt_rdma,
 
 	/* Mount options which take numeric value */
 	Opt_backupuid, Opt_backupgid, Opt_uid,
@@ -185,6 +185,7 @@ static const match_table_t cifs_mount_option_tokens = {
 	{ Opt_resilient, "resilienthandles"},
 	{ Opt_noresilient, "noresilienthandles"},
 	{ Opt_domainauto, "domainauto"},
+	{ Opt_rdma, "rdma"},
 
 	{ Opt_backupuid, "backupuid=%s" },
 	{ Opt_backupgid, "backupgid=%s" },
@@ -1541,6 +1542,9 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
 		case Opt_domainauto:
 			vol->domainauto = true;
 			break;
+		case Opt_rdma:
+			vol->rdma = true;
+			break;
 
 		/* Numeric Values */
 		case Opt_backupuid:
@@ -1931,6 +1935,21 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
 		goto cifs_parse_mount_err;
 	}
 
+	if (vol->rdma) {
+		switch (vol->vals->protocol_id) {
+		case SMB30_PROT_ID:
+		case SMB302_PROT_ID:
+		case SMB311_PROT_ID:
+			break;
+		default:
+			cifs_dbg(
+				VFS,
+				"SMBDirect requires Version "
+				"3.0, 3.02 or 3.1.1\n");
+			goto cifs_parse_mount_err;
+		}
+	}
+
 #ifndef CONFIG_KEYS
 	/* Muliuser mounts require CONFIG_KEYS support */
 	if (vol->multiuser) {
@@ -2134,6 +2153,9 @@ static int match_server(struct TCP_Server_Info *server, struct smb_vol *vol)
 	if (server->echo_interval != vol->echo_interval * HZ)
 		return 0;
 
+	if (server->rdma != vol->rdma)
+		return 0;
+
 	return 1;
 }
 
@@ -2234,6 +2256,7 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
 	tcp_ses->noblocksnd = volume_info->noblocksnd;
 	tcp_ses->noautotune = volume_info->noautotune;
 	tcp_ses->tcp_nodelay = volume_info->sockopt_tcp_nodelay;
+	tcp_ses->rdma = volume_info->rdma;
 	tcp_ses->in_flight = 0;
 	tcp_ses->credits = 1;
 	init_waitqueue_head(&tcp_ses->response_q);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 01/19] CIFS: Add RDMA mount option
@ 2017-08-20 19:04     ` Long Li
  0 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

Add "rdma" to CIFS mount option, which tells CIFS this is for connecting to a SMB server over SMBDirect. Add checks to validate this feature is only used on SMB 3.X dialects.

To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/cifs_debug.c |  2 ++
 fs/cifs/cifsfs.c     |  2 ++
 fs/cifs/cifsglob.h   |  3 +++
 fs/cifs/connect.c    | 25 ++++++++++++++++++++++++-
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9727e1d..ba0870d 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
 				ses->ses_count, ses->serverOS, ses->serverNOS,
 				ses->capabilities, ses->status);
 			}
+			if (server->rdma)
+				seq_printf(m, "RDMA\n\t");
 			seq_printf(m, "TCP status: %d\n\tLocal Users To "
 				   "Server: %d SecMode: 0x%x Req On Wire: %d",
 				   server->tcpStatus, server->srv_count,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index fe0c8dc..a628800 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -330,6 +330,8 @@ cifs_show_address(struct seq_file *s, struct TCP_Server_Info *server)
 	default:
 		seq_puts(s, "(unknown)");
 	}
+	if (server->rdma)
+		seq_puts(s, ",rdma");
 }
 
 static void
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 8289f95..703c2fb 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -531,6 +531,7 @@ struct smb_vol {
 	bool nopersistent:1;
 	bool resilient:1; /* noresilient not required since not fored for CA */
 	bool domainauto:1;
+	bool rdma:1;
 	unsigned int rsize;
 	unsigned int wsize;
 	bool sockopt_tcp_nodelay:1;
@@ -649,6 +650,8 @@ struct TCP_Server_Info {
 	bool	sec_kerberos;		/* supports plain Kerberos */
 	bool	sec_mskerberos;		/* supports legacy MS Kerberos */
 	bool	large_buf;		/* is current buffer large? */
+	/* use SMBD connection instead of socket */
+	bool	rdma;
 	struct delayed_work	echo; /* echo ping workqueue job */
 	char	*smallbuf;	/* pointer to current "small" buffer */
 	char	*bigbuf;	/* pointer to current "big" buffer */
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 2eeaac6..d5d0ecd 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -94,7 +94,7 @@ enum {
 	Opt_multiuser, Opt_sloppy, Opt_nosharesock,
 	Opt_persistent, Opt_nopersistent,
 	Opt_resilient, Opt_noresilient,
-	Opt_domainauto,
+	Opt_domainauto, Opt_rdma,
 
 	/* Mount options which take numeric value */
 	Opt_backupuid, Opt_backupgid, Opt_uid,
@@ -185,6 +185,7 @@ static const match_table_t cifs_mount_option_tokens = {
 	{ Opt_resilient, "resilienthandles"},
 	{ Opt_noresilient, "noresilienthandles"},
 	{ Opt_domainauto, "domainauto"},
+	{ Opt_rdma, "rdma"},
 
 	{ Opt_backupuid, "backupuid=%s" },
 	{ Opt_backupgid, "backupgid=%s" },
@@ -1541,6 +1542,9 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
 		case Opt_domainauto:
 			vol->domainauto = true;
 			break;
+		case Opt_rdma:
+			vol->rdma = true;
+			break;
 
 		/* Numeric Values */
 		case Opt_backupuid:
@@ -1931,6 +1935,21 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
 		goto cifs_parse_mount_err;
 	}
 
+	if (vol->rdma) {
+		switch (vol->vals->protocol_id) {
+		case SMB30_PROT_ID:
+		case SMB302_PROT_ID:
+		case SMB311_PROT_ID:
+			break;
+		default:
+			cifs_dbg(
+				VFS,
+				"SMBDirect requires Version "
+				"3.0, 3.02 or 3.1.1\n");
+			goto cifs_parse_mount_err;
+		}
+	}
+
 #ifndef CONFIG_KEYS
 	/* Muliuser mounts require CONFIG_KEYS support */
 	if (vol->multiuser) {
@@ -2134,6 +2153,9 @@ static int match_server(struct TCP_Server_Info *server, struct smb_vol *vol)
 	if (server->echo_interval != vol->echo_interval * HZ)
 		return 0;
 
+	if (server->rdma != vol->rdma)
+		return 0;
+
 	return 1;
 }
 
@@ -2234,6 +2256,7 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
 	tcp_ses->noblocksnd = volume_info->noblocksnd;
 	tcp_ses->noautotune = volume_info->noautotune;
 	tcp_ses->tcp_nodelay = volume_info->sockopt_tcp_nodelay;
+	tcp_ses->rdma = volume_info->rdma;
 	tcp_ses->in_flight = 0;
 	tcp_ses->credits = 1;
 	init_waitqueue_head(&tcp_ses->response_q);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 02/19] CIFS: SMBD: Add SMBDirect protocol and transport constants
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
@ 2017-08-20 19:04     ` Long Li
  2017-08-20 19:04 ` [Patch v2 04/19] CIFS: SMBD: Add SMBDirect transport to SMB connection and Makefile Long Li
                       ` (16 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox
  Cc: Long Li

From: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>

To prepare for protocol implementation, add constants and user-configurable values in the protocol.

Signed-off-by: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
---
 fs/cifs/smbdirect.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/cifs/smbdirect.h | 20 ++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
new file mode 100644
index 0000000..d785bc1
--- /dev/null
+++ b/fs/cifs/smbdirect.c
@@ -0,0 +1,78 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#include <linux/module.h>
+#include "smbdirect.h"
+#include "cifs_debug.h"
+
+/* SMBD version number */
+#define SMBD_V1	0x0100
+
+/* Port numbers for SMBD transport */
+#define SMB_PORT	445
+#define SMBD_PORT	5445
+
+/* Address lookup and resolve timeout in ms */
+#define RDMA_RESOLVE_TIMEOUT	5000
+
+/* SMBD negotiation timeout in seconds */
+#define SMBD_NEGOTIATE_TIMEOUT	120
+
+/* SMBD minimum receive size and fragmented sized defined in [MS-SMBD] */
+#define SMBD_MIN_RECEIVE_SIZE		128
+#define SMBD_MIN_FRAGMENTED_SIZE	131072
+
+/*
+ * Default maximum number of RDMA read/write outstanding on this connection
+ * This value is possibly decreased during QP creation on hardware limit
+ */
+#define SMBD_CM_RESPONDER_RESOURCES	32
+
+/* Maximum number of retries on data transfer operations */
+#define SMBD_CM_RETRY			6
+/* No need to retry on Receiver Not Ready since SMBD manages credits */
+#define SMBD_CM_RNR_RETRY		0
+
+/*
+ * User configurable initial values per SMBD transport connection
+ * as defined in [MS-SMBD] 3.1.1.1
+ * Those may change after a SMBD negotiation
+ */
+/* The local peer's maximum number of credits to grant to the peer */
+static int receive_credit_max = 255;
+/* The remote peer's credit request of local peer */
+static int send_credit_target = 255;
+/* The maximum single message size can be sent to remote peer */
+static int max_send_size = 1364;
+/*  The maximum fragmented upper-layer payload receive size supported */
+static int max_fragmented_recv_size = 1024 * 1024;
+/*  The maximum single-message size which can be received */
+static int max_receive_size = 8192;
+
+/* The timeout to initiate send of a keepalive message on idle */
+static int keep_alive_interval = 120;
+
+/*
+ * User configurable initial values for RDMA transport
+ * The actual values used may be lower and are limited to hardware capabilities
+ */
+/* Default maximum number of SGEs in a RDMA send/recv */
+static int max_send_sge = SMBDIRECT_MAX_SGE;
+static int max_recv_sge = SMBDIRECT_MAX_SGE;
+/* Default maximum number of SGEs in a RDMA write/read */
+static int max_frmr_depth = 2048;
+
+/* If payload is less than this byte, use RDMA send/recv not read/write */
+static int rdma_readwrite_threshold = 4096;
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
new file mode 100644
index 0000000..06eeb0b
--- /dev/null
+++ b/fs/cifs/smbdirect.h
@@ -0,0 +1,20 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#ifndef _SMBDIRECT_H
+#define _SMBDIRECT_H
+
+#define SMBDIRECT_MAX_SGE	16
+#endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 02/19] CIFS: SMBD: Add SMBDirect protocol and transport constants
@ 2017-08-20 19:04     ` Long Li
  0 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

To prepare for protocol implementation, add constants and user-configurable values in the protocol.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/smbdirect.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/cifs/smbdirect.h | 20 ++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
new file mode 100644
index 0000000..d785bc1
--- /dev/null
+++ b/fs/cifs/smbdirect.c
@@ -0,0 +1,78 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li <longli@microsoft.com>
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#include <linux/module.h>
+#include "smbdirect.h"
+#include "cifs_debug.h"
+
+/* SMBD version number */
+#define SMBD_V1	0x0100
+
+/* Port numbers for SMBD transport */
+#define SMB_PORT	445
+#define SMBD_PORT	5445
+
+/* Address lookup and resolve timeout in ms */
+#define RDMA_RESOLVE_TIMEOUT	5000
+
+/* SMBD negotiation timeout in seconds */
+#define SMBD_NEGOTIATE_TIMEOUT	120
+
+/* SMBD minimum receive size and fragmented sized defined in [MS-SMBD] */
+#define SMBD_MIN_RECEIVE_SIZE		128
+#define SMBD_MIN_FRAGMENTED_SIZE	131072
+
+/*
+ * Default maximum number of RDMA read/write outstanding on this connection
+ * This value is possibly decreased during QP creation on hardware limit
+ */
+#define SMBD_CM_RESPONDER_RESOURCES	32
+
+/* Maximum number of retries on data transfer operations */
+#define SMBD_CM_RETRY			6
+/* No need to retry on Receiver Not Ready since SMBD manages credits */
+#define SMBD_CM_RNR_RETRY		0
+
+/*
+ * User configurable initial values per SMBD transport connection
+ * as defined in [MS-SMBD] 3.1.1.1
+ * Those may change after a SMBD negotiation
+ */
+/* The local peer's maximum number of credits to grant to the peer */
+static int receive_credit_max = 255;
+/* The remote peer's credit request of local peer */
+static int send_credit_target = 255;
+/* The maximum single message size can be sent to remote peer */
+static int max_send_size = 1364;
+/*  The maximum fragmented upper-layer payload receive size supported */
+static int max_fragmented_recv_size = 1024 * 1024;
+/*  The maximum single-message size which can be received */
+static int max_receive_size = 8192;
+
+/* The timeout to initiate send of a keepalive message on idle */
+static int keep_alive_interval = 120;
+
+/*
+ * User configurable initial values for RDMA transport
+ * The actual values used may be lower and are limited to hardware capabilities
+ */
+/* Default maximum number of SGEs in a RDMA send/recv */
+static int max_send_sge = SMBDIRECT_MAX_SGE;
+static int max_recv_sge = SMBDIRECT_MAX_SGE;
+/* Default maximum number of SGEs in a RDMA write/read */
+static int max_frmr_depth = 2048;
+
+/* If payload is less than this byte, use RDMA send/recv not read/write */
+static int rdma_readwrite_threshold = 4096;
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
new file mode 100644
index 0000000..06eeb0b
--- /dev/null
+++ b/fs/cifs/smbdirect.h
@@ -0,0 +1,20 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li <longli@microsoft.com>
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#ifndef _SMBDIRECT_H
+#define _SMBDIRECT_H
+
+#define SMBDIRECT_MAX_SGE	16
+#endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 03/19] CIFS: SMBD: Implement SMBDirect
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 04/19] CIFS: SMBD: Add SMBDirect transport to SMB connection and Makefile Long Li
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

Add code to implement SMBDirect transport and protocol.

1. Add APIs in header file. Upper layer code uses transport through the APIs.
2. Define SMBDirect connection in the header file. A connection is based on a RC QP in RDMA.
3. The implementation doesn't maintain send buffers or send queue for transfering payload via RDMA send. There is no data copy in the transport on send.
4. On the receive path, the implementation maintains receive buffers and reassembly queue for transfering payload via RDMA recv. There is data copy in the transport on recv.
5. The implementation recognizes the RFC1002 header length use in the SMB upper layer payloads in CIFS. Because this length is mainly used for TCP and not applicable to RDMA, it is handled as a out-of-band information never sent over the wire, and the trasnport behaves like TCP to upper layer by processing and exposing the length correctly on data payloads.
6. SMBDirect protocol enforces credits on RDMA send or recv, credits are exchanged and mutually managed by SMB server and client.
7. Each connection defines a user-configuration rdma_readwrite_threshold. Upper layer payloads larger than rdma_readwrite_threshold are sent through RDMA read, and received via RDMA write. There are fixed number of registered memory regions per connection for doing RDMA read/write. There is no data copy in the transport on RDMA read/write.
8. There are choices between workqueue and softirq on RDMA notification calls on CQ completions. Benchmark shows no visible difference between those two. This implemention chooses workqueue IB_POLL_WORKQUEUE, this also avoids using spin_lock_irqsave (use spin_lock instead) throughout the code.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/smbdirect.c | 2250 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/cifs/smbdirect.h |  280 +++++++
 2 files changed, 2530 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index d785bc1..01bf418 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -17,6 +17,35 @@
 #include "smbdirect.h"
 #include "cifs_debug.h"
 
+static struct smbd_response *get_receive_buffer(
+		struct smbd_connection *info);
+static void put_receive_buffer(
+		struct smbd_connection *info,
+		struct smbd_response *response);
+static int allocate_receive_buffers(struct smbd_connection *info, int num_buf);
+static void destroy_receive_buffers(struct smbd_connection *info);
+
+static void enqueue_reassembly(
+		struct smbd_connection *info,
+		struct smbd_response *response, int data_length);
+static struct smbd_response *_get_first_reassembly(
+		struct smbd_connection *info);
+
+static int smbd_post_recv(
+		struct smbd_connection *info,
+		struct smbd_response *response);
+
+static int smbd_post_send_empty(struct smbd_connection *info);
+static int smbd_post_send_data(
+		struct smbd_connection *info,
+		struct kvec *iov, int n_vec, int remaining_data_length);
+static int smbd_post_send_page(struct smbd_connection *info,
+		struct page *page, unsigned long offset,
+		size_t size, int remaining_data_length);
+
+static void destroy_mr_list(struct smbd_connection *info);
+static int allocate_mr_list(struct smbd_connection *info);
+
 /* SMBD version number */
 #define SMBD_V1	0x0100
 
@@ -76,3 +105,2224 @@ static int max_frmr_depth = 2048;
 
 /* If payload is less than this byte, use RDMA send/recv not read/write */
 static int rdma_readwrite_threshold = 4096;
+
+/* Transport logging functions
+ * Logging are defined as classes. They can be OR'ed to define the actual
+ * logging level via module parameter smbd_logging_class
+ * e.g. cifs.smbd_logging_class=0x500 will log all log_rdma_recv() and
+ * log_rdma_event()
+ */
+#define LOG_CREDIT			0x1
+#define LOG_OUTGOING			0x2
+#define LOG_INCOMING			0x4
+#define LOG_RECEIVE_QUEUE		0x8
+#define LOG_REASSEMBLY_QUEUE		0x10
+#define LOG_READ			0x20
+#define LOG_WRITE			0x40
+#define LOG_RDMA_SEND			0x80
+#define LOG_RDMA_RECV			0x100
+#define LOG_KEEP_ALIVE			0x200
+#define LOG_RDMA_EVENT			0x400
+#define LOG_RDMA_MR			0X800
+
+static unsigned int smbd_logging_class = LOG_RDMA_MR;
+module_param(smbd_logging_class, uint, 0644);
+MODULE_PARM_DESC(smbd_logging_class,
+	"Logging class for SMBD transport 0x0 to 0xfff");
+
+#define log_rdma(class, fmt, args...)					\
+do {									\
+	if (class & smbd_logging_class)					\
+		cifs_dbg(VFS, "%s:%d " fmt, __func__, __LINE__, ##args);\
+} while (0)
+
+#define log_rdma_credit(fmt, args...)	log_rdma(LOG_CREDIT, fmt, ##args)
+#define log_outgoing(fmt, args...)	log_rdma(LOG_OUTGOING, fmt, ##args)
+#define log_incoming(fmt, args...)	log_rdma(LOG_INCOMING, fmt, ##args)
+#define log_receive_queue(fmt, args...)		\
+	log_rdma(LOG_RECEIVE_QUEUE, fmt, ##args)
+#define log_reassembly_queue(fmt, args...)	\
+		log_rdma(LOG_REASSEMBLY_QUEUE, fmt, ##args)
+#define log_read(fmt, args...)	log_rdma(LOG_READ, fmt, ##args)
+#define log_write(fmt, args...)	log_rdma(LOG_WRITE, fmt, ##args)
+#define log_rdma_send(fmt, args...)	log_rdma(LOG_RDMA_SEND, fmt, ##args)
+#define log_rdma_recv(fmt, args...)	log_rdma(LOG_RDMA_RECV, fmt, ##args)
+#define log_keep_alive(fmt, args...)	log_rdma(LOG_KEEP_ALIVE, fmt, ##args)
+#define log_rdma_event(fmt, args...)	log_rdma(LOG_RDMA_EVENT, fmt, ##args)
+#define log_rdma_mr(fmt, args...)	log_rdma(LOG_RDMA_MR, fmt, ##args)
+
+#define log_transport_credit(info)					\
+	log_rdma_credit("receive_credits %d receive_credit_target %d "	\
+			"send_credits %d send_credit_target %d\n",	\
+			atomic_read(&info->receive_credits),		\
+			info->receive_credit_target,			\
+			atomic_read(&info->send_credits),		\
+			info->send_credit_target)			\
+
+/*
+ * Destroy the transport and related RDMA and memory resources
+ * Need to go through all the pending counters and make sure on one is using
+ * the transport while it is destroyed
+ */
+static void smbd_destroy_rdma_work(struct work_struct *work)
+{
+	struct smbd_response *response;
+	struct smbd_connection *info =
+		container_of(work, struct smbd_connection, destroy_work);
+
+	log_rdma_event("cancelling all pending works\n");
+	cancel_delayed_work_sync(&info->idle_timer_work);
+	cancel_delayed_work_sync(&info->send_immediate_work);
+
+	ib_drain_qp(info->id->qp);
+	rdma_destroy_qp(info->id);
+
+	log_rdma_event("wait for all send or recv finish\n");
+	wait_event(info->wait_send_pending,
+		atomic_read(&info->send_pending) == 0);
+	wait_event(info->wait_send_payload_pending,
+		atomic_read(&info->send_payload_pending) == 0);
+	wait_event(info->wait_recv_pending,
+		atomic_read(&info->recv_pending) == 0);
+	wait_event(info->wait_read_pending,
+		atomic_read(&info->read_pending) == 0);
+	destroy_mr_list(info);
+
+	log_rdma_event("drain the reassembly queue\n");
+	do {
+		spin_lock(&info->reassembly_queue_lock);
+		response = _get_first_reassembly(info);
+		if (response) {
+			list_del(&response->list);
+			spin_unlock(&info->reassembly_queue_lock);
+			put_receive_buffer(info, response);
+		}
+	} while (response);
+	spin_unlock(&info->reassembly_queue_lock);
+	info->reassembly_data_length = 0;
+	wake_up_interruptible(&info->wait_reassembly_queue);
+
+	log_rdma_event("free receive buffers\n");
+	destroy_receive_buffers(info);
+
+	ib_free_cq(info->send_cq);
+	ib_free_cq(info->recv_cq);
+	ib_dealloc_pd(info->pd);
+	rdma_destroy_id(info->id);
+
+	/* free mempools */
+	mempool_destroy(info->request_mempool);
+	kmem_cache_destroy(info->request_cache);
+
+	mempool_destroy(info->response_mempool);
+	kmem_cache_destroy(info->response_cache);
+
+	info->transport_status = SMBD_DESTROYED;
+	wake_up_all(&info->wait_destroy);
+}
+
+static int smbd_process_disconnected(struct smbd_connection *info)
+{
+	schedule_work(&info->destroy_work);
+	return 0;
+}
+
+static void smbd_disconnect_rdma_work(struct work_struct *work)
+{
+	struct smbd_connection *info =
+		container_of(work, struct smbd_connection, disconnect_work);
+
+	if (info->transport_status == SMBD_CONNECTED) {
+		info->transport_status = SMBD_DISCONNECTING;
+		rdma_disconnect(info->id);
+	}
+}
+
+static void smbd_disconnect_rdma_connection(struct smbd_connection *info)
+{
+	schedule_work(&info->disconnect_work);
+}
+
+/* Upcall from RDMA CM */
+static int smbd_conn_upcall(
+		struct rdma_cm_id *id, struct rdma_cm_event *event)
+{
+	struct smbd_connection *info = id->context;
+
+	log_rdma_event("event=%d status=%d\n", event->event, event->status);
+
+	switch (event->event) {
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+	case RDMA_CM_EVENT_ROUTE_RESOLVED:
+		info->ri_rc = 0;
+		complete(&info->ri_done);
+		break;
+
+	case RDMA_CM_EVENT_ADDR_ERROR:
+		info->ri_rc = -EHOSTUNREACH;
+		complete(&info->ri_done);
+		break;
+
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+		info->ri_rc = -ENETUNREACH;
+		complete(&info->ri_done);
+		break;
+
+	case RDMA_CM_EVENT_ESTABLISHED:
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+	case RDMA_CM_EVENT_REJECTED:
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		log_rdma_event("connected event=%d\n", event->event);
+		info->connect_state = event->event;
+		info->transport_status = SMBD_CONNECTED;
+		wake_up_interruptible(&info->conn_wait);
+		break;
+
+	case RDMA_CM_EVENT_DISCONNECTED:
+		info->transport_status = SMBD_DISCONNECTED;
+		smbd_process_disconnected(info);
+		break;
+
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+/* Upcall from RDMA QP */
+static void
+smbd_qp_async_error_upcall(struct ib_event *event, void *context)
+{
+	struct smbd_connection *info = context;
+
+	log_rdma_event("%s on device %s info %p\n",
+		ib_event_msg(event->event), event->device->name, info);
+
+	switch (event->event) {
+	case IB_EVENT_CQ_ERR:
+	case IB_EVENT_QP_FATAL:
+		smbd_disconnect_rdma_connection(info);
+
+	default:
+		break;
+	}
+}
+
+static inline void *smbd_request_payload(struct smbd_request *request)
+{
+	return (void *)request->packet;
+}
+
+static inline void *smbd_response_payload(struct smbd_response *response)
+{
+	return (void *)response->packet;
+}
+
+/* Called when a RDMA send is done */
+static void send_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	int i;
+	struct smbd_request *request =
+		container_of(wc->wr_cqe, struct smbd_request, cqe);
+
+	log_rdma_send("smbd_request %p completed wc->status=%d\n",
+		request, wc->status);
+
+	if (wc->status != IB_WC_SUCCESS || wc->opcode != IB_WC_SEND) {
+		log_rdma_send("wc->status=%d wc->opcode=%d\n",
+			wc->status, wc->opcode);
+	}
+
+	for (i = 0; i < request->num_sge; i++)
+		ib_dma_unmap_single(request->info->id->device,
+			request->sge[i].addr,
+			request->sge[i].length,
+			DMA_TO_DEVICE);
+
+	if (request->has_payload) {
+		if (atomic_dec_and_test(&request->info->send_payload_pending))
+			wake_up(&request->info->wait_send_payload_pending);
+	} else {
+		if (atomic_dec_and_test(&request->info->send_pending))
+			wake_up(&request->info->wait_send_pending);
+	}
+
+	mempool_free(request, request->info->request_mempool);
+}
+
+static void dump_smbd_negotiate_resp(struct smbd_negotiate_resp *resp)
+{
+	log_rdma_event("resp message min_version %u max_version %u "
+		"negotiated_version %u credits_requested %u "
+		"credits_granted %u status %u max_readwrite_size %u "
+		"preferred_send_size %u max_receive_size %u "
+		"max_fragmented_size %u\n",
+		resp->min_version, resp->max_version, resp->negotiated_version,
+		resp->credits_requested, resp->credits_granted, resp->status,
+		resp->max_readwrite_size, resp->preferred_send_size,
+		resp->max_receive_size, resp->max_fragmented_size);
+}
+
+/*
+ * Process a negotiation response message, according to [MS-SMBD]3.1.5.7
+ * response, packet_length: the negotiation response message
+ * return value: true if negotiation is a success, false if failed
+ */
+static bool process_negotiation_response(
+		struct smbd_response *response, int packet_length)
+{
+	struct smbd_connection *info = response->info;
+	struct smbd_negotiate_resp *packet = smbd_response_payload(response);
+
+	if (packet_length < sizeof(struct smbd_negotiate_resp)) {
+		log_rdma_event("error: packet_length=%d\n", packet_length);
+		return false;
+	}
+
+	if (le16_to_cpu(packet->negotiated_version) != SMBD_V1) {
+		log_rdma_event("error: negotiated_version=%x\n",
+			le16_to_cpu(packet->negotiated_version));
+		return false;
+	}
+	info->protocol = le16_to_cpu(packet->negotiated_version);
+
+	if (packet->credits_requested == 0) {
+		log_rdma_event("error: credits_requested==0\n");
+		return false;
+	}
+	info->receive_credit_target = le16_to_cpu(packet->credits_requested);
+
+	if (packet->credits_granted == 0) {
+		log_rdma_event("error: credits_granted==0\n");
+		return false;
+	}
+	atomic_set(&info->send_credits, le16_to_cpu(packet->credits_granted));
+
+	atomic_set(&info->receive_credits, 0);
+
+	if (le32_to_cpu(packet->preferred_send_size) > info->max_receive_size) {
+		log_rdma_event("error: preferred_send_size=%d\n",
+			le32_to_cpu(packet->preferred_send_size));
+		return false;
+	}
+	info->max_receive_size = le32_to_cpu(packet->preferred_send_size);
+
+	if (le32_to_cpu(packet->max_receive_size) < SMBD_MIN_RECEIVE_SIZE) {
+		log_rdma_event("error: max_receive_size=%d\n",
+			le32_to_cpu(packet->max_receive_size));
+		return false;
+	}
+	info->max_send_size = min_t(int, info->max_send_size,
+					le32_to_cpu(packet->max_receive_size));
+
+	if (le32_to_cpu(packet->max_fragmented_size) <
+			SMBD_MIN_FRAGMENTED_SIZE) {
+		log_rdma_event("error: max_fragmented_size=%d\n",
+			le32_to_cpu(packet->max_fragmented_size));
+		return false;
+	}
+	info->max_fragmented_send_size =
+		le32_to_cpu(packet->max_fragmented_size);
+	info->rdma_readwrite_threshold =
+		rdma_readwrite_threshold > info->max_fragmented_send_size ?
+		info->max_fragmented_send_size :
+		rdma_readwrite_threshold;
+
+
+	info->max_readwrite_size = min_t(u32,
+			le32_to_cpu(packet->max_readwrite_size),
+			info->max_frmr_depth * PAGE_SIZE);
+	info->max_frmr_depth = info->max_readwrite_size / PAGE_SIZE;
+
+	return true;
+}
+
+/*
+ * Check and schedule to send an immediate packet
+ * This is used to extend credtis to remote peer to keep the transport busy
+ */
+static void check_and_send_immediate(struct smbd_connection *info)
+{
+	if (info->transport_status != SMBD_CONNECTED)
+		return;
+
+	info->send_immediate = true;
+
+	/*
+	 * Promptly send a packet if our peer is running low on receive
+	 * credits
+	 */
+	if (atomic_read(&info->receive_credits) <
+		info->receive_credit_target - 1)
+		queue_delayed_work(
+			info->workqueue, &info->send_immediate_work, 0);
+}
+
+/* Called from softirq, when recv is done */
+static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct smbd_data_transfer *data_transfer;
+	struct smbd_response *response =
+		container_of(wc->wr_cqe, struct smbd_response, cqe);
+	struct smbd_connection *info = response->info;
+
+	log_rdma_recv("response=%p type=%d wc status=%d wc opcode %d "
+		      "byte_len=%d pkey_index=%x\n",
+		response, response->type, wc->status, wc->opcode,
+		wc->byte_len, wc->pkey_index);
+
+	if (wc->status != IB_WC_SUCCESS || wc->opcode != IB_WC_RECV) {
+		log_rdma_recv("wc->status=%d opcode=%d\n",
+			wc->status, wc->opcode);
+		goto error;
+	}
+
+	ib_dma_sync_single_for_cpu(
+		wc->qp->device,
+		response->sge.addr,
+		response->sge.length,
+		DMA_FROM_DEVICE);
+
+	switch (response->type) {
+	/* SMBD negotiation response */
+	case SMBD_NEGOTIATE_RESP:
+		dump_smbd_negotiate_resp(smbd_response_payload(response));
+		info->full_packet_received = true;
+		info->negotiate_done =
+			process_negotiation_response(response, wc->byte_len);
+		complete(&info->negotiate_completion);
+		break;
+
+	/* SMBD data transfer packet */
+	case SMBD_TRANSFER_DATA:
+		data_transfer = smbd_response_payload(response);
+		atomic_dec(&info->receive_credits);
+		info->receive_credit_target =
+			le16_to_cpu(data_transfer->credits_requested);
+		atomic_add(le16_to_cpu(data_transfer->credits_granted),
+			&info->send_credits);
+
+		log_incoming("data flags %d data_offset %d data_length %d "
+			     "remaining_data_length %d\n",
+			le16_to_cpu(data_transfer->flags),
+			le32_to_cpu(data_transfer->data_offset),
+			le32_to_cpu(data_transfer->data_length),
+			le32_to_cpu(data_transfer->remaining_data_length));
+
+		log_transport_credit(info);
+
+		/*
+		 * We may have new send credits granted from remote peer
+		 * If any sender is blcoked on lack of credets, unblock it
+		 */
+		if (atomic_read(&info->send_credits))
+			wake_up_interruptible(&info->wait_send_queue);
+
+		/* Send a KEEP_ALIVE response right away if requested */
+		info->keep_alive_requested = KEEP_ALIVE_NONE;
+		if (le16_to_cpu(data_transfer->flags) &
+				le16_to_cpu(SMB_DIRECT_RESPONSE_REQUESTED)) {
+			info->keep_alive_requested = KEEP_ALIVE_PENDING;
+		}
+
+		/*
+		 * Check if we need to send something to remote peer to
+		 * grant more credits or respond to KEEP_ALIVE packet
+		 */
+		check_and_send_immediate(info);
+
+		/*
+		 * If this is a packet with data playload place the data in
+		 * reassembly queue and wake up the reading thread
+		 */
+		if (le32_to_cpu(data_transfer->data_length)) {
+			if (info->full_packet_received)
+				response->first_segment = true;
+
+			if (le32_to_cpu(data_transfer->remaining_data_length))
+				info->full_packet_received = false;
+			else
+				info->full_packet_received = true;
+
+			enqueue_reassembly(
+				info,
+				response,
+				le32_to_cpu(data_transfer->data_length));
+
+			wake_up_interruptible(&info->wait_reassembly_queue);
+			goto queue_done;
+		}
+
+		/* This is an empty packet, finish it */
+		break;
+
+	default:
+		log_rdma_recv("unexpected response type=%d\n", response->type);
+	}
+
+error:
+	put_receive_buffer(info, response);
+
+queue_done:
+	if (atomic_dec_and_test(&info->recv_pending))
+		wake_up(&info->wait_recv_pending);
+}
+
+static struct rdma_cm_id *smbd_create_id(
+		struct smbd_connection *info,
+		struct sockaddr *dstaddr, int port)
+{
+	struct rdma_cm_id *id;
+	int rc;
+	__be16 *sport;
+
+	id = rdma_create_id(&init_net, smbd_conn_upcall, info,
+		RDMA_PS_TCP, IB_QPT_RC);
+	if (IS_ERR(id)) {
+		rc = PTR_ERR(id);
+		log_rdma_event("rdma_create_id() failed %i\n", rc);
+		return id;
+	}
+
+	if (dstaddr->sa_family == AF_INET6)
+		sport = &((struct sockaddr_in6 *)dstaddr)->sin6_port;
+	else
+		sport = &((struct sockaddr_in *)dstaddr)->sin_port;
+
+	*sport = htons(port);
+
+	init_completion(&info->ri_done);
+	info->ri_rc = -ETIMEDOUT;
+
+	rc = rdma_resolve_addr(id, NULL, (struct sockaddr *)dstaddr,
+		RDMA_RESOLVE_TIMEOUT);
+	if (rc) {
+		log_rdma_event("rdma_resolve_addr() failed %i\n", rc);
+		goto out;
+	}
+	wait_for_completion_interruptible_timeout(
+		&info->ri_done, msecs_to_jiffies(RDMA_RESOLVE_TIMEOUT));
+	rc = info->ri_rc;
+	if (rc) {
+		log_rdma_event("rdma_resolve_addr() completed %i\n", rc);
+		goto out;
+	}
+
+	info->ri_rc = -ETIMEDOUT;
+	rc = rdma_resolve_route(id, RDMA_RESOLVE_TIMEOUT);
+	if (rc) {
+		log_rdma_event("rdma_resolve_route() failed %i\n", rc);
+		goto out;
+	}
+	wait_for_completion_interruptible_timeout(
+		&info->ri_done, msecs_to_jiffies(RDMA_RESOLVE_TIMEOUT));
+	rc = info->ri_rc;
+	if (rc) {
+		log_rdma_event("rdma_resolve_route() completed %i\n", rc);
+		goto out;
+	}
+
+	return id;
+
+out:
+	rdma_destroy_id(id);
+	return ERR_PTR(rc);
+}
+
+/*
+ * Test if FRWR (Fast Registration Work Requests) is supported on the device
+ * This implementation requries FRWR on RDMA read/write
+ * return value: true if it is supported
+ */
+static bool frwr_is_supported(struct ib_device_attr *attrs)
+{
+	if (!(attrs->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS))
+		return false;
+	if (attrs->max_fast_reg_page_list_len == 0)
+		return false;
+	return true;
+}
+
+static int smbd_ia_open(
+		struct smbd_connection *info,
+		struct sockaddr *dstaddr, int port)
+{
+	int rc;
+
+	info->id = smbd_create_id(info, dstaddr, port);
+	if (IS_ERR(info->id)) {
+		rc = PTR_ERR(info->id);
+		goto out1;
+	}
+
+	if (!frwr_is_supported(&info->id->device->attrs)) {
+		log_rdma_event(
+			"Fast Registration Work Requests "
+			"(FRWR) is not supported\n");
+		log_rdma_event(
+			"Device capability flags = %llx "
+			"max_fast_reg_page_list_len = %u\n",
+			info->id->device->attrs.device_cap_flags,
+			info->id->device->attrs.max_fast_reg_page_list_len);
+		rc = -EPROTONOSUPPORT;
+		goto out2;
+	}
+	info->max_frmr_depth = min_t(int,
+		max_frmr_depth,
+		info->id->device->attrs.max_fast_reg_page_list_len);
+	info->mr_type = IB_MR_TYPE_MEM_REG;
+	if (info->id->device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
+		info->mr_type = IB_MR_TYPE_SG_GAPS;
+
+	info->pd = ib_alloc_pd(info->id->device, 0);
+	if (IS_ERR(info->pd)) {
+		rc = PTR_ERR(info->pd);
+		log_rdma_event("ib_alloc_pd() returned %d\n", rc);
+		goto out2;
+	}
+
+	return 0;
+
+out2:
+	rdma_destroy_id(info->id);
+	info->id = NULL;
+
+out1:
+	return rc;
+}
+
+/*
+ * Send a negotiation request message to the peer
+ * The negotiation procedure is in [MS-SMBD] 3.1.5.2 and 3.1.5.3
+ * After negotiation, the transport is connected and ready for
+ * carrying upper layer SMB payload
+ */
+static int smbd_post_send_negotiate_req(struct smbd_connection *info)
+{
+	struct ib_send_wr send_wr, *send_wr_fail;
+	int rc = -ENOMEM;
+	struct smbd_request *request;
+	struct smbd_negotiate_req *packet;
+
+	request = mempool_alloc(info->request_mempool, GFP_KERNEL);
+	if (!request)
+		return rc;
+
+	request->info = info;
+
+	packet = smbd_request_payload(request);
+	packet->min_version = cpu_to_le16(SMBD_V1);
+	packet->max_version = cpu_to_le16(SMBD_V1);
+	packet->reserved = 0;
+	packet->credits_requested = cpu_to_le16(info->send_credit_target);
+	packet->preferred_send_size = cpu_to_le32(info->max_send_size);
+	packet->max_receive_size = cpu_to_le32(info->max_receive_size);
+	packet->max_fragmented_size =
+		cpu_to_le32(info->max_fragmented_recv_size);
+
+	request->num_sge = 1;
+	request->sge[0].addr = ib_dma_map_single(
+				info->id->device, (void *)packet,
+				sizeof(*packet), DMA_TO_DEVICE);
+	if (ib_dma_mapping_error(info->id->device, request->sge[0].addr)) {
+		rc = -EIO;
+		goto dma_mapping_failed;
+	}
+
+	request->sge[0].length = sizeof(*packet);
+	request->sge[0].lkey = info->pd->local_dma_lkey;
+
+	ib_dma_sync_single_for_device(
+		info->id->device, request->sge[0].addr,
+		request->sge[0].length, DMA_TO_DEVICE);
+
+	request->cqe.done = send_done;
+
+	send_wr.next = NULL;
+	send_wr.wr_cqe = &request->cqe;
+	send_wr.sg_list = request->sge;
+	send_wr.num_sge = request->num_sge;
+	send_wr.opcode = IB_WR_SEND;
+	send_wr.send_flags = IB_SEND_SIGNALED;
+
+	log_rdma_send("sge addr=%llx length=%x lkey=%x\n",
+		request->sge[0].addr,
+		request->sge[0].length, request->sge[0].lkey);
+
+	request->has_payload = false;
+	atomic_inc(&info->send_pending);
+	rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
+	if (!rc)
+		return 0;
+
+	/* if we reach here, post send failed */
+	log_rdma_send("ib_post_send failed rc=%d\n", rc);
+	atomic_dec(&info->send_pending);
+	ib_dma_unmap_single(info->id->device, request->sge[0].addr,
+		request->sge[0].length, DMA_TO_DEVICE);
+
+dma_mapping_failed:
+	mempool_free(request, info->request_mempool);
+	return rc;
+}
+
+/*
+ * Extend the credits to remote peer
+ * This implements [MS-SMBD] 3.1.5.9
+ * The idea is that we should extend credits to remote peer as quickly as
+ * it's allowed, to maintain data flow. We allocate as much receive
+ * buffer as possible, and extend the receive credits to remote peer
+ * return value: the new credtis being granted.
+ */
+static int manage_credits_prior_sending(struct smbd_connection *info)
+{
+	int ret = 0;
+	struct smbd_response *response;
+	int rc;
+
+	if (info->receive_credit_target > atomic_read(&info->receive_credits)) {
+		while (true) {
+			response = get_receive_buffer(info);
+			if (!response)
+				break;
+
+			response->type = SMBD_TRANSFER_DATA;
+			response->first_segment = false;
+			rc = smbd_post_recv(info, response);
+			if (rc) {
+				log_rdma_recv("post_recv failed rc=%d\n", rc);
+				put_receive_buffer(info, response);
+				break;
+			}
+
+			ret++;
+		}
+	}
+
+	atomic_add(ret, &info->receive_credits);
+	log_transport_credit(info);
+
+	if (ret)
+		info->send_immediate = false;
+
+	return ret;
+}
+
+/*
+ * Check if we need to send a KEEP_ALIVE message
+ * The idle connection timer triggers a KEEP_ALIVE message when expires
+ * SMB_DIRECT_RESPONSE_REQUESTED is set in the message flag to have peer send
+ * back a response.
+ * return value:
+ * 1 if SMB_DIRECT_RESPONSE_REQUESTED needs to be set
+ * 0: otherwise
+ */
+static int manage_keep_alive_before_sending(struct smbd_connection *info)
+{
+	if (info->keep_alive_requested == KEEP_ALIVE_PENDING) {
+		info->keep_alive_requested = KEEP_ALIVE_SENT;
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * Build and prepare the SMBD packet header
+ * This function waits for avaialbe send credits and build a SMBD packet
+ * header. The caller then optional append payload to the packet after
+ * the header
+ * intput values
+ * size: the size of the payload
+ * remaining_data_length: remaining data to send if this is part of a
+ * fragmented packet
+ * output values
+ * request_out: the request allocated from this function
+ * return values: 0 on success, otherwise actual error code returned
+ */
+static int smbd_create_header(struct smbd_connection *info,
+		int size, int remaining_data_length,
+		struct smbd_request **request_out)
+{
+	struct smbd_request *request;
+	struct smbd_data_transfer *packet;
+	int header_length;
+	int rc;
+
+	if (info->transport_status != SMBD_CONNECTED) {
+		log_outgoing("disconnected not sending\n");
+		return -ENOENT;
+	}
+
+	/* Wait for send credits. A SMBD packet needs one credit */
+	rc = wait_event_interruptible(info->wait_send_queue,
+		atomic_read(&info->send_credits) > 0);
+	if (rc)
+		return rc;
+	atomic_dec(&info->send_credits);
+
+	request = mempool_alloc(info->request_mempool, GFP_KERNEL);
+	if (!request) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	request->info = info;
+
+	/* Fill in the packet header */
+	packet = smbd_request_payload(request);
+	packet->credits_requested = cpu_to_le16(info->send_credit_target);
+	packet->credits_granted =
+		cpu_to_le16(manage_credits_prior_sending(info));
+
+	packet->flags = 0;
+	if (manage_keep_alive_before_sending(info))
+		packet->flags |= cpu_to_le16(SMB_DIRECT_RESPONSE_REQUESTED);
+
+	packet->reserved = 0;
+	if (!size)
+		packet->data_offset = 0;
+	else
+		packet->data_offset = cpu_to_le32(24);
+	packet->data_length = cpu_to_le32(size);
+	packet->remaining_data_length = cpu_to_le32(remaining_data_length);
+	packet->padding = 0;
+
+	log_outgoing("credits_requested=%d credits_granted=%d data_offset=%d "
+		     "data_length=%d remaining_data_length=%d\n",
+		le16_to_cpu(packet->credits_requested),
+		le16_to_cpu(packet->credits_granted),
+		le32_to_cpu(packet->data_offset),
+		le32_to_cpu(packet->data_length),
+		le32_to_cpu(packet->remaining_data_length));
+
+	/* Map the packet to DMA */
+	header_length = sizeof(struct smbd_data_transfer);
+	/* If this is a packet without payload, don't send padding */
+	if (!size)
+		header_length = offsetof(struct smbd_data_transfer, padding);
+
+	request->num_sge = 1;
+	request->sge[0].addr = ib_dma_map_single(info->id->device,
+						 (void *)packet,
+						 header_length,
+						 DMA_BIDIRECTIONAL);
+	if (ib_dma_mapping_error(info->id->device, request->sge[0].addr)) {
+		mempool_free(request, info->request_mempool);
+		rc = -EIO;
+		goto err;
+	}
+
+	request->sge[0].length = header_length;
+	request->sge[0].lkey = info->pd->local_dma_lkey;
+
+	*request_out = request;
+	return 0;
+
+err:
+	atomic_inc(&info->send_credits);
+	return rc;
+}
+
+static void smbd_destroy_header(struct smbd_connection *info,
+		struct smbd_request *request)
+{
+
+	ib_dma_unmap_single(info->id->device,
+			    request->sge[0].addr,
+			    request->sge[0].length,
+			    DMA_TO_DEVICE);
+	mempool_free(request, info->request_mempool);
+	atomic_inc(&info->send_credits);
+}
+
+/* Post the send request */
+static int smbd_post_send(struct smbd_connection *info,
+		struct smbd_request *request, bool has_payload)
+{
+	struct ib_send_wr send_wr, *send_wr_fail;
+	int rc, i;
+
+	for (i = 0; i < request->num_sge; i++) {
+		log_rdma_send("rdma_request sge[%d] addr=%llu legnth=%u\n",
+			i, request->sge[0].addr, request->sge[0].length);
+		ib_dma_sync_single_for_device(
+			info->id->device,
+			request->sge[i].addr,
+			request->sge[i].length,
+			DMA_TO_DEVICE);
+	}
+
+	request->cqe.done = send_done;
+
+	send_wr.next = NULL;
+	send_wr.wr_cqe = &request->cqe;
+	send_wr.sg_list = request->sge;
+	send_wr.num_sge = request->num_sge;
+	send_wr.opcode = IB_WR_SEND;
+	send_wr.send_flags = IB_SEND_SIGNALED;
+
+	if (has_payload) {
+		request->has_payload = true;
+		atomic_inc(&info->send_payload_pending);
+	} else {
+		request->has_payload = false;
+		atomic_inc(&info->send_pending);
+	}
+
+	rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
+	if (rc) {
+		log_rdma_send("ib_post_send failed rc=%d\n", rc);
+		if (has_payload) {
+			if (atomic_dec_and_test(&info->send_payload_pending))
+				wake_up(&info->wait_send_payload_pending);
+		} else {
+			if (atomic_dec_and_test(&info->send_pending))
+				wake_up(&info->wait_send_pending);
+		}
+	} else
+		/* Reset timer for idle connection after packet is sent */
+		mod_delayed_work(info->workqueue, &info->idle_timer_work,
+			info->keep_alive_interval*HZ);
+
+	return rc;
+}
+
+/*
+ * Send a page
+ * page: the page to send
+ * offset: offset in the page to send
+ * size: length in the page to send
+ * remaining_data_length: remaining data to send in this payload
+ */
+static int smbd_post_send_page(struct smbd_connection *info, struct page *page,
+		unsigned long offset, size_t size, int remaining_data_length)
+{
+	struct smbd_request *request;
+	int rc;
+
+	rc = smbd_create_header(info, size, remaining_data_length, &request);
+	if (rc)
+		return rc;
+
+	/* Add payload to packet */
+	request->num_sge++;
+	request->sge[1].addr = ib_dma_map_page(info->id->device, page,
+					       offset, size, DMA_BIDIRECTIONAL);
+	if (ib_dma_mapping_error(info->id->device, request->sge[1].addr)) {
+		smbd_destroy_header(info, request);
+		return -EIO;
+	}
+	request->sge[1].length = size;
+	request->sge[1].lkey = info->pd->local_dma_lkey;
+
+	rc = smbd_post_send(info, request, true);
+	if (rc) {
+		ib_dma_unmap_single(info->id->device,
+			    request->sge[1].addr,
+			    request->sge[1].length,
+			    DMA_TO_DEVICE);
+		smbd_destroy_header(info, request);
+
+	}
+	return rc;
+}
+
+/*
+ * Send an empty message
+ * Empty message is used to extend credits to peer to for keep live
+ * while there is no upper layer payload to send at the time
+ */
+static int smbd_post_send_empty(struct smbd_connection *info)
+{
+	struct smbd_request *request;
+	int rc;
+
+	rc = smbd_create_header(info, 0, 0, &request);
+	if (rc)
+		return rc;
+
+	info->count_send_empty++;
+	rc = smbd_post_send(info, request, false);
+	if (rc)
+		smbd_destroy_header(info, request);
+
+	return rc;
+}
+
+/*
+ * Send a data buffer
+ * iov: the iov array describing the data buffers
+ * n_vec: number of iov array
+ * remaining_data_length: remaining data to send following this packet
+ * in segmented SMBD packet
+ */
+static int smbd_post_send_data(
+	struct smbd_connection *info, struct kvec *iov, int n_vec,
+	int remaining_data_length)
+{
+	struct smbd_request *request;
+	int rc, i;
+	u32 data_length = 0;
+
+	for (i = 0; i < n_vec; i++)
+		data_length += iov[i].iov_len;
+	rc = smbd_create_header(
+		info, data_length, remaining_data_length, &request);
+	if (rc)
+		return rc;
+
+	for (i = 0; i < n_vec; i++) {
+		request->sge[i+1].addr =
+			ib_dma_map_single(info->id->device, iov[i].iov_base,
+				iov[i].iov_len, DMA_BIDIRECTIONAL);
+		if (ib_dma_mapping_error(
+				info->id->device, request->sge[i+1].addr)) {
+			rc = -EIO;
+			request->sge[i+1].addr = 0;
+			goto dma_mapping_failure;
+		}
+		request->sge[i+1].length = iov[i].iov_len;
+		request->sge[i+1].lkey = info->pd->local_dma_lkey;
+		request->num_sge++;
+	}
+
+	rc = smbd_post_send(info, request, true);
+	if (!rc)
+		return 0;
+
+dma_mapping_failure:
+	for (i = 1; i < request->num_sge; i++)
+		if (request->sge[i].addr)
+			ib_dma_unmap_single(info->id->device,
+					    request->sge[i].addr,
+					    request->sge[i].length,
+					    DMA_TO_DEVICE);
+	smbd_destroy_header(info, request);
+	return rc;
+}
+
+/*
+ * Post a receive request to the transport
+ * The remote peer can only send data when a receive request is posted
+ * The interaction is controlled by send/receive credit system
+ */
+static int smbd_post_recv(
+		struct smbd_connection *info, struct smbd_response *response)
+{
+	struct ib_recv_wr recv_wr, *recv_wr_fail = NULL;
+	int rc = -EIO;
+
+	response->sge.addr = ib_dma_map_single(
+				info->id->device, response->packet,
+				info->max_receive_size, DMA_FROM_DEVICE);
+	if (ib_dma_mapping_error(info->id->device, response->sge.addr))
+		return rc;
+
+	response->sge.length = info->max_receive_size;
+	response->sge.lkey = info->pd->local_dma_lkey;
+
+	response->cqe.done = recv_done;
+
+	recv_wr.wr_cqe = &response->cqe;
+	recv_wr.next = NULL;
+	recv_wr.sg_list = &response->sge;
+	recv_wr.num_sge = 1;
+
+	atomic_inc(&info->recv_pending);
+	rc = ib_post_recv(info->id->qp, &recv_wr, &recv_wr_fail);
+	if (rc) {
+		ib_dma_unmap_single(info->id->device, response->sge.addr,
+				    response->sge.length, DMA_FROM_DEVICE);
+
+		log_rdma_recv("ib_post_recv failed rc=%d\n", rc);
+		atomic_dec(&info->recv_pending);
+	}
+
+	return rc;
+}
+
+/* Perform SMBD negotiate according to [MS-SMBD] 3.1.5.2 */
+static int smbd_negotiate(struct smbd_connection *info)
+{
+	int rc;
+	struct smbd_response *response = get_receive_buffer(info);
+
+	response->type = SMBD_NEGOTIATE_RESP;
+	rc = smbd_post_recv(info, response);
+	log_rdma_event("smbd_post_recv rc=%d iov.addr=%llx iov.length=%x "
+		       "iov.lkey=%x\n",
+		rc, response->sge.addr,
+		response->sge.length, response->sge.lkey);
+	if (rc)
+		return rc;
+
+	init_completion(&info->negotiate_completion);
+	info->negotiate_done = false;
+	rc = smbd_post_send_negotiate_req(info);
+	if (rc)
+		return rc;
+
+	rc = wait_for_completion_interruptible_timeout(
+		&info->negotiate_completion, SMBD_NEGOTIATE_TIMEOUT * HZ);
+	log_rdma_event("wait_for_completion_timeout rc=%d\n", rc);
+
+	if (info->negotiate_done)
+		return 0;
+
+	if (rc == 0)
+		rc = -ETIMEDOUT;
+	else if (rc == -ERESTARTSYS)
+		rc = -EINTR;
+	else
+		rc = -ENOTCONN;
+
+	return rc;
+}
+
+/*
+ * Implement Connection.FragmentReassemblyBuffer defined in [MS-SMBD] 3.1.1.1
+ * This is a queue for reassembling upper layer payload and present to upper
+ * layer. All the inncoming payload go to the reassembly queue, regardless of
+ * if reassembly is required. The uuper layer code reads from the queue for all
+ * incoming payloads.
+ * Put a received packet to the reassembly queue
+ * response: the packet received
+ * data_length: the size of payload in this packet
+ */
+static void enqueue_reassembly(
+	struct smbd_connection *info,
+	struct smbd_response *response,
+	int data_length)
+{
+	spin_lock(&info->reassembly_queue_lock);
+	list_add_tail(&response->list, &info->reassembly_queue);
+	info->reassembly_data_length += data_length;
+	log_reassembly_queue("info->reassembly_data_length=%d\n",
+			info->reassembly_data_length);
+	info->count_reassembly_queue++;
+	info->count_enqueue_reassembly_queue++;
+	spin_unlock(&info->reassembly_queue_lock);
+}
+
+/*
+ * Get the first entry at the front of reassembly queue
+ * Caller is responsible for locking
+ * return value: the first entry if any, NULL if queue is empty
+ */
+static struct smbd_response *_get_first_reassembly(struct smbd_connection *info)
+{
+	struct smbd_response *ret = NULL;
+
+	if (!list_empty(&info->reassembly_queue)) {
+		ret = list_first_entry(
+			&info->reassembly_queue,
+			struct smbd_response, list);
+	}
+	return ret;
+}
+
+/*
+ * Get a receive buffer
+ * For each remote send, we need to post a receive. The receive buffers are
+ * pre-allocated in advance.
+ * return value: the receive buffer, NULL if none is available
+ */
+static struct smbd_response *get_receive_buffer(struct smbd_connection *info)
+{
+	struct smbd_response *ret = NULL;
+
+	spin_lock(&info->receive_queue_lock);
+	if (!list_empty(&info->receive_queue)) {
+		ret = list_first_entry(
+			&info->receive_queue,
+			struct smbd_response, list);
+		list_del(&ret->list);
+		info->count_receive_buffer--;
+		info->count_get_receive_buffer++;
+	}
+	spin_unlock(&info->receive_queue_lock);
+
+	return ret;
+}
+
+/*
+ * Return a receive buffer
+ * Upon returning of a receive buffer, we can post new receive and extend
+ * more receive credits to remote peer. This is done immediately after a
+ * receive buffer is returned.
+ */
+static void put_receive_buffer(
+	struct smbd_connection *info, struct smbd_response *response)
+{
+	ib_dma_unmap_single(info->id->device, response->sge.addr,
+		response->sge.length, DMA_FROM_DEVICE);
+
+	spin_lock(&info->receive_queue_lock);
+	list_add_tail(&response->list, &info->receive_queue);
+	info->count_receive_buffer++;
+	info->count_put_receive_buffer++;
+	spin_unlock(&info->receive_queue_lock);
+
+	/* Check if we can post new receive and grant credits to peer */
+	check_and_send_immediate(info);
+}
+
+/* Preallocate all receive buffer on transport establishment */
+static int allocate_receive_buffers(struct smbd_connection *info, int num_buf)
+{
+	int i;
+	struct smbd_response *response;
+
+	INIT_LIST_HEAD(&info->reassembly_queue);
+	spin_lock_init(&info->reassembly_queue_lock);
+	info->reassembly_data_length = 0;
+
+	INIT_LIST_HEAD(&info->receive_queue);
+	spin_lock_init(&info->receive_queue_lock);
+
+	for (i = 0; i < num_buf; i++) {
+		response = mempool_alloc(info->response_mempool, GFP_KERNEL);
+		if (!response)
+			goto allocate_failed;
+
+		response->info = info;
+		list_add_tail(&response->list, &info->receive_queue);
+		info->count_receive_buffer++;
+	}
+
+	return 0;
+
+allocate_failed:
+	while (!list_empty(&info->receive_queue)) {
+		response = list_first_entry(
+				&info->receive_queue,
+				struct smbd_response, list);
+		list_del(&response->list);
+		info->count_receive_buffer--;
+
+		mempool_free(response, info->response_mempool);
+	}
+	return -ENOMEM;
+}
+
+static void destroy_receive_buffers(struct smbd_connection *info)
+{
+	struct smbd_response *response;
+
+	while ((response = get_receive_buffer(info)))
+		mempool_free(response, info->response_mempool);
+}
+
+/*
+ * Check and send an immediate or keep alive packet
+ * The condition to send those packets are defined in [MS-SMBD] 3.1.1.1
+ * Connection.KeepaliveRequested and Connection.SendImmediate
+ * The idea is to extend credits to server as soon as it becomes available
+ */
+static void send_immediate_work(struct work_struct *work)
+{
+	struct smbd_connection *info = container_of(
+					work, struct smbd_connection,
+					send_immediate_work.work);
+
+	if (info->keep_alive_requested == KEEP_ALIVE_PENDING ||
+	    info->send_immediate) {
+		log_keep_alive("send an empty message\n");
+		smbd_post_send_empty(info);
+	}
+}
+
+/* Implement idle connection timer [MS-SMBD] 3.1.6.2 */
+static void idle_connection_timer(struct work_struct *work)
+{
+	struct smbd_connection *info = container_of(
+					work, struct smbd_connection,
+					idle_timer_work.work);
+
+	if (info->keep_alive_requested != KEEP_ALIVE_NONE) {
+		log_keep_alive("error status info->keep_alive_requested=%d\n",
+				info->keep_alive_requested);
+		smbd_disconnect_rdma_connection(info);
+	}
+
+	log_keep_alive("about to send an empty idle message\n");
+	smbd_post_send_empty(info);
+
+	/* Setup the next idle timeout work */
+	queue_delayed_work(info->workqueue, &info->idle_timer_work,
+				info->keep_alive_interval*HZ);
+}
+
+/* Destroy this SMBD connection, called from upper layer */
+void smbd_destroy(struct smbd_connection *info)
+{
+	log_rdma_event("destroying rdma session\n");
+
+	/* Kick off the disconnection process */
+	if (info->transport_status == SMBD_CONNECTED)
+		rdma_disconnect(info->id);
+
+	info->server_info->tcpStatus = CifsExiting;
+
+	log_rdma_event("wait for transport being destroyed\n");
+	wait_event(info->wait_destroy,
+		info->transport_status == SMBD_DESTROYED);
+}
+
+/*
+ * Reconnect this SMBD connection, called from upper layer
+ * return value: 0 on success, or actual error code
+ */
+int smbd_reconnect(struct TCP_Server_Info *server)
+{
+	log_rdma_event("reconnecting rdma session\n");
+
+	/* why reconnect while it is still connected? */
+	if (server->smbd_conn->transport_status == SMBD_CONNECTED) {
+		log_rdma_event("still connected, not reconnecting\n");
+		return -EINVAL;
+	}
+
+	/* wait until the transport is destroyed */
+	wait_event(server->smbd_conn->wait_destroy,
+		server->smbd_conn->transport_status == SMBD_DESTROYED);
+
+	kfree(server->smbd_conn);
+
+	log_rdma_event("creating rdma session\n");
+	server->smbd_conn = smbd_get_connection(
+		server, (struct sockaddr *) &server->dstaddr);
+
+	return server->smbd_conn ? 0 : -ENOENT;
+}
+
+#define MAX_NAME_LEN	80
+static int allocate_caches_and_workqueue(struct smbd_connection *info)
+{
+	char name[MAX_NAME_LEN];
+
+	snprintf(name, MAX_NAME_LEN, "smbd_request_%p", info);
+	info->request_cache =
+		kmem_cache_create(
+			name,
+			sizeof(struct smbd_request) +
+				sizeof(struct smbd_data_transfer),
+			0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!info->request_cache)
+		return -ENOMEM;
+
+	info->request_mempool =
+		mempool_create(info->send_credit_target, mempool_alloc_slab,
+			mempool_free_slab, info->request_cache);
+	if (!info->request_mempool)
+		goto out1;
+
+	snprintf(name, MAX_NAME_LEN, "smbd_response_%p", info);
+	info->response_cache =
+		kmem_cache_create(
+			name,
+			sizeof(struct smbd_response) +
+				info->max_receive_size,
+			0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!info->response_cache)
+		goto out2;
+
+	info->response_mempool =
+		mempool_create(info->receive_credit_max, mempool_alloc_slab,
+		       mempool_free_slab, info->response_cache);
+	if (!info->response_mempool)
+		goto out3;
+
+	snprintf(name, MAX_NAME_LEN, "smbd_%p", info);
+	info->workqueue = create_workqueue(name);
+	if (!info->workqueue)
+		goto out4;
+
+	return 0;
+
+out4:
+	mempool_destroy(info->response_mempool);
+out3:
+	kmem_cache_destroy(info->response_cache);
+out2:
+	mempool_destroy(info->request_mempool);
+out1:
+	kmem_cache_destroy(info->request_cache);
+	return -ENOMEM;
+}
+
+/* Create a SMBD connection, called by upper layer */
+struct smbd_connection *smbd_get_connection(
+	struct TCP_Server_Info *server, struct sockaddr *dstaddr)
+{
+	int rc;
+	struct smbd_connection *info;
+	struct rdma_conn_param conn_param;
+	struct ib_qp_init_attr qp_attr;
+	struct sockaddr_in *addr_in = (struct sockaddr_in *) dstaddr;
+	int port;
+
+	info = kzalloc(sizeof(struct smbd_connection), GFP_KERNEL);
+	if (!info)
+		return NULL;
+
+	info->server_info = server;
+	info->transport_status = SMBD_CONNECTING;
+
+	port = SMB_PORT;
+try_another_port:
+	rc = smbd_ia_open(info, dstaddr, port);
+	if (rc) {
+		log_rdma_event("smbd_ia_open rc=%d\n", rc);
+		goto out1;
+	}
+
+	if (send_credit_target > info->id->device->attrs.max_cqe ||
+	    send_credit_target > info->id->device->attrs.max_qp_wr) {
+		log_rdma_event("consider lowering send_credit_target = %d. "
+			"Possible CQE overrun, device "
+			"reporting max_cpe %d max_qp_wr %d\n",
+			send_credit_target,
+			info->id->device->attrs.max_cqe,
+			info->id->device->attrs.max_qp_wr);
+		goto out2;
+	}
+
+	if (receive_credit_max > info->id->device->attrs.max_cqe ||
+	    receive_credit_max > info->id->device->attrs.max_qp_wr) {
+		log_rdma_event("consider lowering receive_credit_max = %d. "
+			"Possible CQE overrun, device "
+			"reporting max_cpe %d max_qp_wr %d\n",
+			receive_credit_max,
+			info->id->device->attrs.max_cqe,
+			info->id->device->attrs.max_qp_wr);
+		goto out2;
+	}
+
+	info->receive_credit_max = receive_credit_max;
+	info->send_credit_target = send_credit_target;
+	info->max_send_size = max_send_size;
+	info->max_fragmented_recv_size = max_fragmented_recv_size;
+	info->max_receive_size = max_receive_size;
+	info->keep_alive_interval = keep_alive_interval;
+
+	max_send_sge = min_t(int, max_send_sge,
+		info->id->device->attrs.max_sge);
+	max_recv_sge = min_t(int, max_recv_sge,
+		info->id->device->attrs.max_sge_rd);
+
+	info->send_cq = ib_alloc_cq(info->id->device, info,
+			info->send_credit_target, 0, IB_POLL_WORKQUEUE);
+	if (IS_ERR(info->send_cq))
+		goto out2;
+
+	info->recv_cq = ib_alloc_cq(info->id->device, info,
+			info->receive_credit_max, 0, IB_POLL_WORKQUEUE);
+	if (IS_ERR(info->recv_cq))
+		goto out2;
+
+	memset(&qp_attr, 0, sizeof(qp_attr));
+	qp_attr.event_handler = smbd_qp_async_error_upcall;
+	qp_attr.qp_context = info;
+	qp_attr.cap.max_send_wr = info->send_credit_target;
+	qp_attr.cap.max_recv_wr = info->receive_credit_max;
+	qp_attr.cap.max_send_sge = max_send_sge;
+	qp_attr.cap.max_recv_sge = max_recv_sge;
+	qp_attr.cap.max_inline_data = 0;
+	qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+	qp_attr.qp_type = IB_QPT_RC;
+	qp_attr.send_cq = info->send_cq;
+	qp_attr.recv_cq = info->recv_cq;
+	qp_attr.port_num = ~0;
+
+	rc = rdma_create_qp(info->id, info->pd, &qp_attr);
+	if (rc) {
+		log_rdma_event("rdma_create_qp failed %i\n", rc);
+		rc = -ENETUNREACH;
+		goto out2;
+	}
+
+	memset(&conn_param, 0, sizeof(conn_param));
+	conn_param.private_data = NULL;
+	conn_param.private_data_len = 0;
+	conn_param.initiator_depth = 0;
+
+	conn_param.responder_resources =
+		info->id->device->attrs.max_qp_rd_atom
+			< SMBD_CM_RESPONDER_RESOURCES ?
+		info->id->device->attrs.max_qp_rd_atom :
+		SMBD_CM_RESPONDER_RESOURCES;
+	info->responder_resources = conn_param.responder_resources;
+	log_rdma_mr("responder_resources=%d\n", info->responder_resources);
+
+	conn_param.retry_count = SMBD_CM_RETRY;
+	conn_param.rnr_retry_count = SMBD_CM_RNR_RETRY;
+	conn_param.flow_control = 0;
+	init_waitqueue_head(&info->conn_wait);
+	init_waitqueue_head(&info->wait_destroy);
+
+	log_rdma_event("connecting to IP %pI4 port %d\n",
+		&addr_in->sin_addr, port);
+
+	rc = rdma_connect(info->id, &conn_param);
+	if (rc) {
+		log_rdma_event("rdma_connect() failed with %i\n", rc);
+		goto out2;
+	}
+
+	wait_event_interruptible(
+		info->conn_wait, info->transport_status == SMBD_CONNECTED);
+	if (info->connect_state != RDMA_CM_EVENT_ESTABLISHED)
+		goto out2;
+
+	log_rdma_event("rdma_connect connected\n");
+
+	rc = allocate_caches_and_workqueue(info);
+	if (rc) {
+		log_rdma_event("cache allocation failed\n");
+		goto out2;
+	}
+
+	rc = allocate_receive_buffers(info, info->receive_credit_max);
+	if (rc) {
+		log_rdma_event("failed to allocate receive buffers\n");
+		goto out2;
+	}
+
+	init_waitqueue_head(&info->wait_send_queue);
+	init_waitqueue_head(&info->wait_reassembly_queue);
+
+	INIT_DELAYED_WORK(&info->idle_timer_work, idle_connection_timer);
+	INIT_DELAYED_WORK(&info->send_immediate_work, send_immediate_work);
+	queue_delayed_work(info->workqueue, &info->idle_timer_work,
+		info->keep_alive_interval*HZ);
+
+	init_waitqueue_head(&info->wait_send_pending);
+	atomic_set(&info->send_pending, 0);
+
+	init_waitqueue_head(&info->wait_send_payload_pending);
+	atomic_set(&info->send_payload_pending, 0);
+
+	init_waitqueue_head(&info->wait_recv_pending);
+	atomic_set(&info->recv_pending, 0);
+
+	init_waitqueue_head(&info->wait_read_pending);
+	atomic_set(&info->read_pending, 0);
+
+	INIT_WORK(&info->disconnect_work, smbd_disconnect_rdma_work);
+	INIT_WORK(&info->destroy_work, smbd_destroy_rdma_work);
+
+	rc = smbd_negotiate(info);
+	if (rc) {
+		log_rdma_event("smbd_negotiate rc=%d\n", rc);
+		goto negotiation_failed;
+	}
+
+	rc = allocate_mr_list(info);
+	if (rc) {
+		log_rdma_mr("memory registration allocation failed\n");
+		goto negotiation_failed;
+	}
+
+	return info;
+
+negotiation_failed:
+	smbd_destroy(info);
+
+out2:
+	rdma_destroy_id(info->id);
+	/* try port SMBD_PORT if SMB_PORT doesn't work */
+	if (port == SMB_PORT) {
+		port = SMBD_PORT;
+		goto try_another_port;
+	}
+
+out1:
+	kfree(info);
+	return NULL;
+}
+
+/*
+ * Receive a page from receive reassembly queue
+ * page: the page to read data into
+ * to_read: the length of data to read
+ * return value: actual data read
+ */
+int smbd_recv_page(struct smbd_connection *info,
+		struct page *page, unsigned int to_read)
+{
+	int ret;
+	char *to_address;
+
+	/* make sure we have the page ready for read */
+	ret = wait_event_interruptible(
+		info->wait_reassembly_queue,
+		info->reassembly_data_length >= to_read ||
+			info->transport_status != SMBD_CONNECTED);
+	if (ret)
+		return 0;
+
+	/* now we can read from reassembly queue and not sleep */
+	to_address = kmap_atomic(page);
+
+	log_read("reading from page=%p address=%p to_read=%d\n",
+		page, to_address, to_read);
+
+	ret = smbd_recv(info, to_address, to_read);
+	kunmap_atomic(to_address);
+
+	return ret;
+}
+
+/*
+ * Receive data from receive reassembly queue
+ * All the incoming data packets are placed in reassembly queue
+ * buf: the buffer to read data into
+ * size: the length of data to read
+ * return value: actual data read
+ * Note: this implementation copies the data from reassebmly queue to receive
+ * buffers used by upper layer. This is not the optimal code path. A better way
+ * to do it is to not have upper layer allocate its receive buffers but rather
+ * borrow the buffer from reassembly queue, and return it after data is
+ * consumed. But this will require more changes to upper layer code, and also
+ * need to consider packet boundaries while they still being reassembled.
+ */
+int smbd_recv(struct smbd_connection *info, char *buf, unsigned int size)
+{
+	struct smbd_response *response;
+	struct smbd_data_transfer *data_transfer;
+	int to_copy, to_read, data_read, offset;
+	u32 data_length, remaining_data_length, data_offset;
+	int rc;
+
+again:
+	/* the transport is disconnected? */
+	if (info->transport_status != SMBD_CONNECTED) {
+		log_read("disconnected\n");
+
+		/*
+		 * FIXME If upper layer code is reading SMB packet length
+		 * return 0 to indicate transport is disconnected and
+		 * trigger a reconnect.
+		 */
+		return 0;
+	}
+
+	/*
+	 * No need to hold the reassembly queue lock all the time as we are
+	 * the only one reading from the front of the queue. The transport
+	 * may add more entries to the back of the queeu at the same time
+	 */
+	log_read("size=%d info->reassembly_data_length=%d\n", size,
+		info->reassembly_data_length);
+	if (info->reassembly_data_length >= size) {
+		atomic_inc(&info->read_pending);
+		data_read = 0;
+		to_read = size;
+		offset = info->first_entry_offset;
+		while (data_read < size) {
+			spin_lock(&info->reassembly_queue_lock);
+			response = _get_first_reassembly(info);
+			spin_unlock(&info->reassembly_queue_lock);
+			data_transfer = smbd_response_payload(response);
+
+			data_length = le32_to_cpu(data_transfer->data_length);
+			remaining_data_length =
+				le32_to_cpu(
+					data_transfer->remaining_data_length);
+			data_offset = le32_to_cpu(data_transfer->data_offset);
+
+			/*
+			 * The upper layer expects RFC1002 length at the
+			 * beginning of the payload. Return it to indicate
+			 * the total length of the packet. This minimize the
+			 * change to upper layer packet processing logic. This
+			 * will be eventually remove when an intermediate
+			 * transport layer is added
+			 */
+			if (response->first_segment && size == 4) {
+				unsigned int rfc1002_len =
+					data_length + remaining_data_length;
+				*((__be32 *)buf) = cpu_to_be32(rfc1002_len);
+				data_read = 4;
+				response->first_segment = false;
+				log_read("returning rfc1002 length %d\n",
+					rfc1002_len);
+				goto read_rfc1002_done;
+			}
+
+			to_copy = min_t(int, data_length - offset, to_read);
+			memcpy(
+				buf + data_read,
+				(char *)data_transfer + data_offset + offset,
+				to_copy);
+
+			/* move on to the next buffer? */
+			if (to_copy == data_length - offset) {
+				spin_lock(&info->reassembly_queue_lock);
+				list_del(&response->list);
+				spin_unlock(&info->reassembly_queue_lock);
+
+				info->count_reassembly_queue--;
+				info->count_dequeue_reassembly_queue++;
+				put_receive_buffer(info, response);
+				offset = 0;
+				log_read("put_receive_buffer offset=0\n");
+			} else
+				offset += to_copy;
+
+			to_read -= to_copy;
+			data_read += to_copy;
+
+			log_read("_get_first_reassembly memcpy %d bytes "
+				"data_transfer_length-offset=%d after that "
+				"to_read=%d data_read=%d offset=%d\n",
+				to_copy, data_length - offset,
+				to_read, data_read, offset);
+		}
+
+		spin_lock(&info->reassembly_queue_lock);
+		info->reassembly_data_length -= data_read;
+		spin_unlock(&info->reassembly_queue_lock);
+
+		info->first_entry_offset = offset;
+		log_read("returning to thread data_read=%d "
+			"reassembly_data_length=%d first_entry_offset=%d\n",
+			data_read, info->reassembly_data_length,
+			info->first_entry_offset);
+read_rfc1002_done:
+		if (atomic_dec_and_test(&info->read_pending))
+			wake_up(&info->wait_read_pending);
+		return data_read;
+	}
+
+	log_read("wait_event on more data\n");
+	rc = wait_event_interruptible(
+		info->wait_reassembly_queue,
+		info->reassembly_data_length >= size ||
+			info->transport_status != SMBD_CONNECTED);
+	/* Don't return any data if interrupted */
+	if (rc)
+		return 0;
+
+	goto again;
+}
+
+/*
+ * Send data to transport
+ * Each rqst is transported as a SMBDirect payload
+ * rqst: the data to write
+ * return value: 0 if successfully write, otherwise error code
+ */
+int smbd_send(struct smbd_connection *info, struct smb_rqst *rqst)
+{
+	struct kvec vec;
+	int nvecs;
+	int size;
+	int buflen = 0, remaining_data_length;
+	int start, i, j;
+	int max_iov_size =
+		info->max_send_size - sizeof(struct smbd_data_transfer);
+	struct kvec iov[SMBDIRECT_MAX_SGE];
+	int rc;
+
+	if (info->transport_status != SMBD_CONNECTED) {
+		log_write("disconnected returning -EIO\n");
+		return -EIO;
+	}
+
+	/*
+	 * This usually means a configuration error
+	 * We use RDMA read/write for packet size > rdma_readwrite_threshold
+	 * as long as it's properly configured we should never get into this
+	 * situation
+	 */
+	if (rqst->rq_nvec + rqst->rq_npages > SMBDIRECT_MAX_SGE) {
+		log_write("maximum send segment %x exceeding %x\n",
+			 rqst->rq_nvec + rqst->rq_npages, SMBDIRECT_MAX_SGE);
+		return -EINVAL;
+	}
+
+	/*
+	 * Remove the RFC1002 length defined in MS-SMB2 section 2.1
+	 * It is used only for TCP transport
+	 * In future we may want to add a transport layer under protocol
+	 * layer so this will only be issued to TCP transport
+	 */
+	iov[0].iov_base = (char *)rqst->rq_iov[0].iov_base + 4;
+	iov[0].iov_len = rqst->rq_iov[0].iov_len - 4;
+	buflen += iov[0].iov_len;
+
+	/* total up iov array first */
+	for (i = 1; i < rqst->rq_nvec; i++) {
+		iov[i].iov_base = rqst->rq_iov[i].iov_base;
+		iov[i].iov_len = rqst->rq_iov[i].iov_len;
+		buflen += iov[i].iov_len;
+	}
+
+	/* add in the page array if there is one */
+	if (rqst->rq_npages) {
+		buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
+		buflen += rqst->rq_tailsz;
+	}
+
+	if (buflen + sizeof(struct smbd_data_transfer) >
+		info->max_fragmented_send_size) {
+		log_write("payload size %d > max size %d\n",
+			buflen, info->max_fragmented_send_size);
+		rc = -EINVAL;
+		goto done;
+	}
+
+	remaining_data_length = buflen;
+
+	log_write("rqst->rq_nvec=%d rqst->rq_npages=%d rq_pagesz=%d "
+		"rq_tailsz=%d buflen=%d\n",
+		rqst->rq_nvec, rqst->rq_npages, rqst->rq_pagesz,
+		rqst->rq_tailsz, buflen);
+
+	start = i = iov[0].iov_len ? 0 : 1;
+	buflen = 0;
+	while (true) {
+		buflen += iov[i].iov_len;
+		if (buflen > max_iov_size) {
+			if (i > start) {
+				remaining_data_length -=
+					(buflen-iov[i].iov_len);
+				log_write("sending iov[] from start=%d "
+					"i=%d nvecs=%d "
+					"remaining_data_length=%d\n",
+					start, i, i-start,
+					remaining_data_length);
+				rc = smbd_post_send_data(
+					info, &iov[start], i-start,
+					remaining_data_length);
+				if (rc)
+					goto done;
+			} else {
+				/* iov[start] is too big, break it */
+				nvecs = (buflen+max_iov_size-1)/max_iov_size;
+				log_write("iov[%d] iov_base=%p buflen=%d"
+					" break to %d vectors\n",
+					start, iov[start].iov_base,
+					buflen, nvecs);
+				for (j = 0; j < nvecs; j++) {
+					vec.iov_base =
+						(char *)iov[start].iov_base +
+						j*max_iov_size;
+					vec.iov_len = max_iov_size;
+					if (j == nvecs-1)
+						vec.iov_len =
+							buflen -
+							max_iov_size*(nvecs-1);
+					remaining_data_length -= vec.iov_len;
+					log_write(
+						"sending vec j=%d iov_base=%p"
+						" iov_len=%lu "
+						"remaining_data_length=%d\n",
+						j, vec.iov_base, vec.iov_len,
+						remaining_data_length);
+					rc = smbd_post_send_data(
+						info, &vec, 1,
+						remaining_data_length);
+					if (rc)
+						goto done;
+				}
+				i++;
+			}
+			start = i;
+			buflen = 0;
+		} else {
+			i++;
+			if (i == rqst->rq_nvec) {
+				/* send out all remaining vecs */
+				remaining_data_length -= buflen;
+				log_write(
+					"sending iov[] from start=%d i=%d "
+					"nvecs=%d remaining_data_length=%d\n",
+					start, i, i-start,
+					remaining_data_length);
+				rc = smbd_post_send_data(info, &iov[start],
+					i-start, remaining_data_length);
+				if (rc)
+					goto done;
+				break;
+			}
+		}
+		log_write("looping i=%d buflen=%d\n", i, buflen);
+	}
+
+	/* now sending pages if there are any */
+	for (i = 0; i < rqst->rq_npages; i++) {
+		buflen = (i == rqst->rq_npages-1) ?
+			rqst->rq_tailsz : rqst->rq_pagesz;
+		nvecs = (buflen + max_iov_size - 1) / max_iov_size;
+		log_write("sending pages buflen=%d nvecs=%d\n",
+			buflen, nvecs);
+		for (j = 0; j < nvecs; j++) {
+			size = max_iov_size;
+			if (j == nvecs-1)
+				size = buflen - j*max_iov_size;
+			remaining_data_length -= size;
+			log_write("sending pages i=%d offset=%d size=%d"
+				" remaining_data_length=%d\n",
+				i, j*max_iov_size, size, remaining_data_length);
+			rc = smbd_post_send_page(
+				info, rqst->rq_pages[i], j*max_iov_size,
+				size, remaining_data_length);
+			if (rc)
+				goto done;
+		}
+	}
+
+done:
+	/*
+	 * As an optimization, we don't wait for individual I/O to finish
+	 * before sending the next one.
+	 * Send them all and wait for pending send count to get to 0
+	 * that means all the I/Os have been out and we are good to return
+	 */
+	wait_event(info->wait_send_payload_pending,
+		atomic_read(&info->send_payload_pending) == 0);
+	return rc;
+}
+
+static void register_mr_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	if (wc->status)
+		log_rdma_mr("status=%d\n", wc->status);
+}
+
+/*
+ * The work queue function that recovers MRs
+ * We need to call ib_dereg_mr() and ib_alloc_mr() before this MR can be used
+ * again. Both calls are slow, so finish them in a workqueue. This will not
+ * block I/O path.
+ * There is one workqueue that recovers MRs, there is no need to lock as the
+ * I/O requests calling smbd_register_mr will never update the links in the
+ * mr_list.
+ */
+static void smbd_mr_recovery_work(struct work_struct *work)
+{
+	struct smbd_connection *info =
+		container_of(work, struct smbd_connection, mr_recovery_work);
+	struct smbd_mr *smbdirect_mr;
+	int rc;
+
+	list_for_each_entry(smbdirect_mr, &info->mr_list, list) {
+		if (smbdirect_mr->state == MR_INVALIDATED ||
+			smbdirect_mr->state == MR_ERROR) {
+
+			/* recover this MR entry */
+			if (smbdirect_mr->state == MR_INVALIDATED)
+				ib_dma_unmap_sg(
+					info->id->device,
+					smbdirect_mr->sgl,
+					smbdirect_mr->sgl_count,
+					smbdirect_mr->dir);
+
+			rc = ib_dereg_mr(smbdirect_mr->mr);
+			if (rc) {
+				log_rdma_mr("ib_dereg_mr faield rc=%x\n", rc);
+				rdma_disconnect(info->id);
+			}
+
+			smbdirect_mr->mr = ib_alloc_mr(
+				info->pd, info->mr_type, info->max_frmr_depth);
+			if (IS_ERR(smbdirect_mr->mr)) {
+				log_rdma_mr(
+					"ib_alloc_mr failed mr_type=%x "
+					"max_frmr_depth=%x\n",
+					info->mr_type, info->max_frmr_depth);
+				rdma_disconnect(info->id);
+			}
+
+			smbdirect_mr->state = MR_READY;
+			/* smbdirect_mr->state is updated by this function
+			 * and is read and updated by I/O issuing CPUs trying
+			 * to get a MR, the call to atomic_inc_return
+			 * implicates a memory barrier and guarantees this
+			 * value is updated before waking up any calls to
+			 * get_mr() from the I/O issuing CPUs
+			 */
+			if (atomic_inc_return(&info->mr_ready_count) == 1)
+				wake_up_interruptible(&info->wait_mr);
+		}
+	}
+}
+
+static void destroy_mr_list(struct smbd_connection *info)
+{
+	struct smbd_mr *mr, *tmp;
+
+	cancel_work_sync(&info->mr_recovery_work);
+	list_for_each_entry_safe(mr, tmp, &info->mr_list, list) {
+		if (mr->state == MR_INVALIDATED)
+			ib_dma_unmap_sg(info->id->device, mr->sgl,
+				mr->sgl_count, mr->dir);
+		ib_dereg_mr(mr->mr);
+		kfree(mr->sgl);
+		kfree(mr);
+	}
+}
+
+/*
+ * Allocate MRs used for RDMA read/write
+ * The number of MRs will not exceed hardware capability in responder_resources
+ * All MRs are kept in mr_list. The MR can be recovered after it's used
+ * Recovery is done in smbd_mr_recovery_work. The content of list entry changes
+ * as MRs are used and recovered for I/O, but the list links will not change
+ */
+static int allocate_mr_list(struct smbd_connection *info)
+{
+	int i;
+	struct smbd_mr *smbdirect_mr, *tmp;
+
+	INIT_LIST_HEAD(&info->mr_list);
+	init_waitqueue_head(&info->wait_mr);
+	spin_lock_init(&info->mr_list_lock);
+	atomic_set(&info->mr_ready_count, 0);
+	for (i = 0; i < info->responder_resources; i++) {
+		smbdirect_mr = kzalloc(sizeof(*smbdirect_mr), GFP_KERNEL);
+		if (!smbdirect_mr)
+			goto out;
+		smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
+					info->max_frmr_depth);
+		if (IS_ERR(smbdirect_mr->mr)) {
+			log_rdma_mr("ib_alloc_mr failed mr_type=%x "
+				"max_frmr_depth=%x\n",
+				info->mr_type, info->max_frmr_depth);
+			goto out;
+		}
+		smbdirect_mr->sgl = kcalloc(
+					info->max_frmr_depth,
+					sizeof(struct scatterlist),
+					GFP_KERNEL);
+		if (!smbdirect_mr->sgl) {
+			log_rdma_mr("failed to allocate sgl\n");
+			ib_dereg_mr(smbdirect_mr->mr);
+			goto out;
+		}
+		smbdirect_mr->state = MR_READY;
+		smbdirect_mr->conn = info;
+
+		list_add_tail(&smbdirect_mr->list, &info->mr_list);
+		atomic_inc(&info->mr_ready_count);
+	}
+	INIT_WORK(&info->mr_recovery_work, smbd_mr_recovery_work);
+	return 0;
+
+out:
+	kfree(smbdirect_mr);
+
+	list_for_each_entry_safe(smbdirect_mr, tmp, &info->mr_list, list) {
+		ib_dereg_mr(smbdirect_mr->mr);
+		kfree(smbdirect_mr->sgl);
+		kfree(smbdirect_mr);
+	}
+	return -ENOMEM;
+}
+
+/*
+ * Get a MR from mr_list. This function waits until there is at least one
+ * MR available in the list. It may access the list while the
+ * smbd_mr_recovery_work is recovering the MR list. This doesn't need a lock
+ * as they never modify the same places. However, there may be several CPUs
+ * issueing I/O trying to get MR at the same time, mr_list_lock is used to
+ * protect this situation.
+ */
+static struct smbd_mr *get_mr(struct smbd_connection *info)
+{
+	struct smbd_mr *ret;
+	int rc;
+again:
+	rc = wait_event_interruptible(info->wait_mr,
+		atomic_read(&info->mr_ready_count) ||
+		info->transport_status != SMBD_CONNECTED);
+	if (rc) {
+		log_rdma_mr("wait_event_interruptible rc=%x\n", rc);
+		return NULL;
+	}
+
+	if (info->transport_status != SMBD_CONNECTED) {
+		log_rdma_mr("info->transport_status=%x\n",
+			info->transport_status);
+		return NULL;
+	}
+
+	spin_lock(&info->mr_list_lock);
+	list_for_each_entry(ret, &info->mr_list, list) {
+		if (ret->state == MR_READY) {
+			ret->state = MR_REGISTERED;
+			atomic_dec(&info->mr_ready_count);
+			spin_unlock(&info->mr_list_lock);
+			return ret;
+		}
+	}
+
+	spin_unlock(&info->mr_list_lock);
+	/*
+	 * It is possible that we can get a MR because other processes may try
+	 * to acquire a MR at the same time. If this is the case, retry it.
+	 */
+	goto again;
+}
+
+/*
+ * Register memory for RDMA read/write
+ * pages[]: the list of pages to register memory with
+ * num_pages: the number of pages to register
+ * tailsz: if non-zero, the bytes to register in the last page
+ * writing: true if this is a RDMA write (SMB read), false for RDMA read
+ * need_invalidate: true if this MR needs to be locally invalidated after I/O
+ * return value: the MR registered, NULL if failed.
+ */
+struct smbd_mr *smbd_register_mr(
+	struct smbd_connection *info, struct page *pages[], int num_pages,
+	int tailsz, bool writing, bool need_invalidate)
+{
+	struct smbd_mr *smbdirect_mr;
+	int rc, i;
+	enum dma_data_direction dir;
+	struct ib_reg_wr *reg_wr;
+	struct ib_send_wr *bad_wr;
+
+	if (num_pages > info->max_frmr_depth) {
+		log_rdma_mr("num_pages=%d max_frmr_depth=%d\n",
+			num_pages, info->max_frmr_depth);
+		return NULL;
+	}
+
+	smbdirect_mr = get_mr(info);
+	if (!smbdirect_mr) {
+		log_rdma_mr("get_mr returning NULL\n");
+		return NULL;
+	}
+	smbdirect_mr->need_invalidate = need_invalidate;
+	smbdirect_mr->sgl_count = num_pages;
+	sg_init_table(smbdirect_mr->sgl, num_pages);
+
+	for (i = 0; i < num_pages - 1; i++)
+		sg_set_page(&smbdirect_mr->sgl[i], pages[i], PAGE_SIZE, 0);
+	sg_set_page(&smbdirect_mr->sgl[i], pages[i],
+		tailsz ? tailsz : PAGE_SIZE, 0);
+
+	dir = writing ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
+	smbdirect_mr->dir = dir;
+	rc = ib_dma_map_sg(info->id->device, smbdirect_mr->sgl, num_pages, dir);
+	if (!rc) {
+		log_rdma_mr("ib_dma_map_sg num_pages=%x dir=%x rc=%x\n",
+			num_pages, dir, rc);
+		goto dma_map_error;
+	}
+
+	rc = ib_map_mr_sg(smbdirect_mr->mr, smbdirect_mr->sgl, num_pages,
+		NULL, PAGE_SIZE);
+	if (rc != num_pages) {
+		log_rdma_mr("ib_map_mr_sg failed rc = %x num_pages = %x\n",
+			rc, num_pages);
+		goto map_mr_error;
+	}
+
+	ib_update_fast_reg_key(smbdirect_mr->mr,
+		ib_inc_rkey(smbdirect_mr->mr->rkey));
+	reg_wr = &smbdirect_mr->wr;
+	reg_wr->wr.opcode = IB_WR_REG_MR;
+	smbdirect_mr->cqe.done = register_mr_done;
+	reg_wr->wr.wr_cqe = &smbdirect_mr->cqe;
+	reg_wr->wr.num_sge = 0;
+	reg_wr->wr.send_flags = IB_SEND_SIGNALED;
+	reg_wr->mr = smbdirect_mr->mr;
+	reg_wr->key = smbdirect_mr->mr->rkey;
+	reg_wr->access = writing ?
+			IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+			IB_ACCESS_REMOTE_READ;
+
+	/*
+	 * There is no need for waiting for complemtion on ib_post_send
+	 * on IB_WR_REG_MR. Hardware enforces a barrier and order of execution
+	 * on the next ib_post_send when we actaully send I/O to remote peer
+	 */
+	rc = ib_post_send(info->id->qp, &reg_wr->wr, &bad_wr);
+	if (!rc)
+		return smbdirect_mr;
+
+	log_rdma_mr("ib_post_send failed rc=%x reg_wr->key=%x\n",
+		rc, reg_wr->key);
+
+	/* If all failed, attempt to recover this MR by setting it MR_ERROR*/
+map_mr_error:
+	ib_dma_unmap_sg(info->id->device, smbdirect_mr->sgl,
+		smbdirect_mr->sgl_count, smbdirect_mr->dir);
+
+dma_map_error:
+	smbdirect_mr->state = MR_ERROR;
+
+	return NULL;
+}
+
+static void local_inv_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct smbd_mr *smbdirect_mr;
+	struct ib_cqe *cqe;
+
+	cqe = wc->wr_cqe;
+	smbdirect_mr = container_of(cqe, struct smbd_mr, cqe);
+	smbdirect_mr->state = MR_INVALIDATED;
+	if (wc->status != IB_WC_SUCCESS) {
+		log_rdma_mr("invalidate failed status=%x\n", wc->status);
+		smbdirect_mr->state = MR_ERROR;
+	}
+	complete(&smbdirect_mr->invalidate_done);
+}
+
+/*
+ * Deregister a MR after I/O is done
+ * This function may wait if remote invalidation is not used
+ * and we have to locally invalidate the buffer to prevent data is being
+ * modified by remote peer after upper layer consumes it
+ */
+int smbd_deregister_mr(struct smbd_mr *smbdirect_mr)
+{
+	struct ib_send_wr *wr, *bad_wr;
+	struct smbd_connection *info = smbdirect_mr->conn;
+	int rc;
+
+	if (info->transport_status != SMBD_CONNECTED)
+		return -ENODEV;
+
+	if (smbdirect_mr->need_invalidate) {
+		/* Need to finish local invalidation before returning */
+		wr = &smbdirect_mr->inv_wr;
+		wr->opcode = IB_WR_LOCAL_INV;
+		smbdirect_mr->cqe.done = local_inv_done;
+		wr->wr_cqe = &smbdirect_mr->cqe;
+		wr->num_sge = 0;
+		wr->ex.invalidate_rkey = smbdirect_mr->mr->rkey;
+		wr->send_flags = IB_SEND_SIGNALED;
+
+		init_completion(&smbdirect_mr->invalidate_done);
+		rc = ib_post_send(info->id->qp, wr, &bad_wr);
+		if (rc) {
+			log_rdma_mr("ib_post_send failed rc=%x\n", rc);
+			rdma_disconnect(info->id);
+			return rc;
+		}
+		wait_for_completion(&smbdirect_mr->invalidate_done);
+	} else
+		/*
+		 * For remote invalidation, just set it to MR_INVALIDATED
+		 * and defer to mr_recovery_work to recover the MR for next use
+		 */
+		smbdirect_mr->state = MR_INVALIDATED;
+
+	/*
+	 * Schedule the work to do MR recovery for future I/Os
+	 * MR recovery is slow and we don't want it to block the current I/O
+	 */
+	schedule_work(&info->mr_recovery_work);
+
+	return 0;
+}
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 06eeb0b..ec692eb 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -16,5 +16,285 @@
 #ifndef _SMBDIRECT_H
 #define _SMBDIRECT_H
 
+#include "cifsglob.h"
+#include <rdma/ib_verbs.h>
+#include <rdma/rdma_cm.h>
+#include <linux/mempool.h>
+
+enum keep_alive_status {
+	KEEP_ALIVE_NONE,
+	KEEP_ALIVE_PENDING,
+	KEEP_ALIVE_SENT,
+};
+
+enum smbd_connection_status {
+	SMBD_CREATED,
+	SMBD_CONNECTING,
+	SMBD_CONNECTED,
+	SMBD_DISCONNECTING,
+	SMBD_DISCONNECTED,
+	SMBD_DESTROYED
+};
+
+/*
+ * The context for the SMBDirect transport
+ * Everything related to the transport is here. It has several logical parts
+ * 1. RDMA related structures
+ * 2. SMBDirect connection parameters
+ * 3. Memory registrations
+ * 4. Receive and reassembly queues for data receive path
+ * 5. mempools for allocating packets
+ */
+struct smbd_connection {
+	struct TCP_Server_Info *server_info;
+	enum smbd_connection_status transport_status;
+
+	/* RDMA related */
+	struct rdma_cm_id *id;
+	struct ib_qp_init_attr qp_attr;
+	struct ib_pd *pd;
+	struct ib_cq *send_cq, *recv_cq;
+	struct ib_device_attr dev_attr;
+	int connect_state;
+	int ri_rc;
+	struct completion ri_done;
+	wait_queue_head_t conn_wait;
+	wait_queue_head_t wait_destroy;
+
+	struct completion negotiate_completion;
+	bool negotiate_done;
+
+	struct work_struct destroy_work;
+	struct work_struct disconnect_work;
+
+	/* Connection parameters defined in [MS-SMBD] 3.1.1.1 */
+	int receive_credit_max;
+	int send_credit_target;
+	int max_send_size;
+	int max_fragmented_recv_size;
+	int max_fragmented_send_size;
+	int max_receive_size;
+	int keep_alive_interval;
+	int max_readwrite_size;
+	enum keep_alive_status keep_alive_requested;
+	int protocol;
+	atomic_t send_credits;
+	atomic_t receive_credits;
+	int receive_credit_target;
+	int fragment_reassembly_remaining;
+
+	/* Memory registrations */
+	/* Maximum number of RDMA read/write outstanding on this connection */
+	int responder_resources;
+	/* Maximum number of SGEs in a RDMA write/read */
+	int max_frmr_depth;
+	/*
+	 * If payload is less than or equal to the threshold,
+	 * use RDMA send/recv to send upper layer I/O.
+	 * If payload is more than the threshold,
+	 * use RDMA read/write through memory registration for I/O.
+	 */
+	int rdma_readwrite_threshold;
+	enum ib_mr_type mr_type;
+	struct list_head mr_list;
+	spinlock_t mr_list_lock;
+	/* The number of available MRs ready for memory registration */
+	atomic_t mr_ready_count;
+	wait_queue_head_t wait_mr;
+	struct work_struct mr_recovery_work;
+
+	/* Activity accoutning */
+	atomic_t send_pending;
+	wait_queue_head_t wait_send_pending;
+	atomic_t send_payload_pending;
+	wait_queue_head_t wait_send_payload_pending;
+	atomic_t recv_pending;
+	wait_queue_head_t wait_recv_pending;
+	atomic_t read_pending;
+	wait_queue_head_t wait_read_pending;
+
+	/* Receive queue */
+	struct list_head receive_queue;
+	spinlock_t receive_queue_lock;
+
+	/* Reassembly queue */
+	struct list_head reassembly_queue;
+	spinlock_t reassembly_queue_lock;
+	wait_queue_head_t wait_reassembly_queue;
+
+	/* total data length of reassembly queue */
+	int reassembly_data_length;
+	/* the offset to first buffer in reassembly queue */
+	int first_entry_offset;
+
+	bool send_immediate;
+
+	wait_queue_head_t wait_send_queue;
+
+	/*
+	 * Indicate if we have received a full packet on the connection
+	 * This is used to identify the first SMBD packet of a assembled
+	 * payload (SMB packet) in reassembly queue so we can return a
+	 * RFC1002 length to upper layer to indicate the length of the SMB
+	 * packet received
+	 */
+	bool full_packet_received;
+
+	struct workqueue_struct *workqueue;
+	struct delayed_work idle_timer_work;
+	struct delayed_work send_immediate_work;
+
+	/* Memory pool for preallocating buffers */
+	/* request pool for RDMA send */
+	struct kmem_cache *request_cache;
+	mempool_t *request_mempool;
+
+	/* response pool for RDMA receive */
+	struct kmem_cache *response_cache;
+	mempool_t *response_mempool;
+
+	/* for debug purposes */
+	unsigned int count_receive_buffer;
+	unsigned int count_get_receive_buffer;
+	unsigned int count_put_receive_buffer;
+	unsigned int count_reassembly_queue;
+	unsigned int count_enqueue_reassembly_queue;
+	unsigned int count_dequeue_reassembly_queue;
+	unsigned int count_send_empty;
+};
+
+enum smbd_message_type {
+	SMBD_NEGOTIATE_RESP,
+	SMBD_TRANSFER_DATA,
+};
+
+#define SMB_DIRECT_RESPONSE_REQUESTED 0x0001
+
+/* SMBD negotiation request packet [MS-SMBD] 2.2.1 */
+struct smbd_negotiate_req {
+	__le16 min_version;
+	__le16 max_version;
+	__le16 reserved;
+	__le16 credits_requested;
+	__le32 preferred_send_size;
+	__le32 max_receive_size;
+	__le32 max_fragmented_size;
+} __packed;
+
+/* SMBD negotiation response packet [MS-SMBD] 2.2.2 */
+struct smbd_negotiate_resp {
+	__le16 min_version;
+	__le16 max_version;
+	__le16 negotiated_version;
+	__le16 reserved;
+	__le16 credits_requested;
+	__le16 credits_granted;
+	__le32 status;
+	__le32 max_readwrite_size;
+	__le32 preferred_send_size;
+	__le32 max_receive_size;
+	__le32 max_fragmented_size;
+} __packed;
+
+/* SMBD data transfer packet with payload [MS-SMBD] 2.2.3 */
+struct smbd_data_transfer {
+	__le16 credits_requested;
+	__le16 credits_granted;
+	__le16 flags;
+	__le16 reserved;
+	__le32 remaining_data_length;
+	__le32 data_offset;
+	__le32 data_length;
+	__le32 padding;
+	__u8 buffer[];
+} __packed;
+
+/* The packet fields for a registered RDMA buffer */
+struct smbd_buffer_descriptor_v1 {
+	__le64 offset;
+	__le32 token;
+	__le32 length;
+} __packed;
+
 #define SMBDIRECT_MAX_SGE	16
+/* The context for a SMBD request */
+struct smbd_request {
+	struct smbd_connection *info;
+	struct ib_cqe cqe;
+
+	/* true if this request carries upper layer payload */
+	bool has_payload;
+
+	/* the SGE entries for this packet */
+	struct ib_sge sge[SMBDIRECT_MAX_SGE];
+	int num_sge;
+
+	/* SMBD packet header follows this structure */
+	u8 packet[];
+};
+
+/* The context for a SMBD response */
+struct smbd_response {
+	struct smbd_connection *info;
+	struct ib_cqe cqe;
+	struct ib_sge sge;
+
+	enum smbd_message_type type;
+
+	/* Link to receive queue or reassembly queue */
+	struct list_head list;
+
+	/* Indicate if this is the 1st packet of a payload */
+	bool first_segment;
+
+	/* SMBD packet header and payload follows this structure */
+	u8 packet[];
+};
+
+/* Create a SMBDirect session */
+struct smbd_connection *smbd_get_connection(
+	struct TCP_Server_Info *server, struct sockaddr *dstaddr);
+
+/* Reconnect SMBDirect session */
+int smbd_reconnect(struct TCP_Server_Info *server);
+
+/* Destroy SMBDirect session */
+void smbd_destroy(struct smbd_connection *info);
+
+/* Interface for carrying upper layer I/O through send/recv */
+int smbd_recv(
+	struct smbd_connection *rdma, char *buf, unsigned int to_read);
+int smbd_recv_page(
+	struct smbd_connection *rdma, struct page *page, unsigned int to_read);
+int smbd_send(struct smbd_connection *rdma, struct smb_rqst *rqst);
+
+enum mr_state {
+	MR_READY,
+	MR_REGISTERED,
+	MR_INVALIDATED,
+	MR_ERROR
+};
+
+struct smbd_mr {
+	struct smbd_connection	*conn;
+	struct list_head	list;
+	enum mr_state		state;
+	struct ib_mr		*mr;
+	struct scatterlist	*sgl;
+	int			sgl_count;
+	enum dma_data_direction	dir;
+	union {
+		struct ib_reg_wr	wr;
+		struct ib_send_wr	inv_wr;
+	};
+	struct ib_cqe		cqe;
+	bool			need_invalidate;
+	struct completion	invalidate_done;
+};
+
+/* Interfaces to register and deregister MR for RDMA read/write */
+struct smbd_mr *smbd_register_mr(
+	struct smbd_connection *rdma, struct page *pages[], int num_pages,
+	int tailsz, bool writing, bool need_invalidate);
+int smbd_deregister_mr(struct smbd_mr *mr);
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 04/19] CIFS: SMBD: Add SMBDirect transport to SMB connection and Makefile
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
  2017-08-20 19:04 ` [Patch v2 03/19] CIFS: SMBD: " Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 05/19] CIFS: SMBD: Connect to SMBDirect session Long Li
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

Add SMBDirect as an optional connection to SMB session defined in CIFS. When connection is on SMBDirect, upper layer uses this connection to carry payloads.

With the transport hooked up, add SMBDirect code to Makefile.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/Makefile   | 2 +-
 fs/cifs/cifsglob.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/Makefile b/fs/cifs/Makefile
index eed7eb0..6bb9863 100644
--- a/fs/cifs/Makefile
+++ b/fs/cifs/Makefile
@@ -18,4 +18,4 @@ cifs-$(CONFIG_CIFS_DFS_UPCALL) += dns_resolve.o cifs_dfs_ref.o
 cifs-$(CONFIG_CIFS_FSCACHE) += fscache.o cache.o
 
 cifs-$(CONFIG_CIFS_SMB2) += smb2ops.o smb2maperror.o smb2transport.o \
-			    smb2misc.o smb2pdu.o smb2inode.o smb2file.o
+			    smb2misc.o smb2pdu.o smb2inode.o smb2file.o smbdirect.o
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 703c2fb..dc5404d 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -652,6 +652,8 @@ struct TCP_Server_Info {
 	bool	large_buf;		/* is current buffer large? */
 	/* use SMBD connection instead of socket */
 	bool	rdma;
+	/* point to the SMBD connection if RDMA is used instead of socket */
+	struct smbd_connection *smbd_conn;
 	struct delayed_work	echo; /* echo ping workqueue job */
 	char	*smallbuf;	/* pointer to current "small" buffer */
 	char	*bigbuf;	/* pointer to current "big" buffer */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 05/19] CIFS: SMBD: Connect to SMBDirect session
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
  2017-08-20 19:04 ` [Patch v2 03/19] CIFS: SMBD: " Long Li
  2017-08-20 19:04 ` [Patch v2 04/19] CIFS: SMBD: Add SMBDirect transport to SMB connection and Makefile Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 06/19] CIFS: SMBD: Reconnect " Long Li
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When "rdma" is specified in the mount option, CIFS attempts to connect to SMBDirect instead of TCP socket.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/connect.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index d5d0ecd..309eba0 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -45,6 +45,7 @@
 #include <linux/parser.h>
 #include <linux/bvec.h>
 
+#include "smbdirect.h"
 #include "cifspdu.h"
 #include "cifsglob.h"
 #include "cifsproto.h"
@@ -2299,12 +2300,26 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
 	else
 		tcp_ses->echo_interval = SMB_ECHO_INTERVAL_DEFAULT * HZ;
 
+	if (tcp_ses->rdma) {
+		tcp_ses->smbd_conn = smbd_get_connection(
+			tcp_ses, (struct sockaddr *)&volume_info->dstaddr);
+		if (tcp_ses->smbd_conn) {
+			cifs_dbg(VFS, "RDMA transport established\n");
+			rc = 0;
+			goto connected;
+		} else {
+			rc = -ENOENT;
+			goto out_err_crypto_release;
+		}
+	}
+
 	rc = ip_connect(tcp_ses);
 	if (rc < 0) {
 		cifs_dbg(VFS, "Error connecting to socket. Aborting operation.\n");
 		goto out_err_crypto_release;
 	}
 
+connected:
 	/*
 	 * since we're in a cifs function already, we know that
 	 * this will succeed. No need for try_module_get().
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 06/19] CIFS: SMBD: Reconnect to SMBDirect session
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (2 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 05/19] CIFS: SMBD: Connect to SMBDirect session Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 07/19] CIFS: SMBD: Destroy SMBDirect session on shutdown or umount Long Li
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

Do a reconnect on SMBDirect when it is used as the connection. Reconnect can happen for many reasons and it's mostly the decision of upper layer SMB2 not SMBDirect.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/connect.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 309eba0..b337ca7 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -409,7 +409,11 @@ cifs_reconnect(struct TCP_Server_Info *server)
 
 		/* we should try only the port we connected to before */
 		mutex_lock(&server->srv_mutex);
-		rc = generic_ip_connect(server);
+		if (server->rdma)
+			rc = smbd_reconnect(server);
+		else
+			rc = generic_ip_connect(server);
+
 		if (rc) {
 			cifs_dbg(FYI, "reconnect error %d\n", rc);
 			mutex_unlock(&server->srv_mutex);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 07/19] CIFS: SMBD: Destroy SMBDirect session on shutdown or umount
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (3 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 06/19] CIFS: SMBD: Reconnect " Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 08/19] CIFS: SMBD: Set SMBDirect maximum read or write size for I/O Long Li
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

Finally, when CIFS wants to umount, do a proper shutdown on transport.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/connect.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index b337ca7..f65950f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -708,6 +708,11 @@ static void clean_demultiplex_info(struct TCP_Server_Info *server)
 	/* give those requests time to exit */
 	msleep(125);
 
+	if (server->smbd_conn) {
+		smbd_destroy(server->smbd_conn);
+		server->smbd_conn = NULL;
+	}
+
 	if (server->ssocket) {
 		sock_release(server->ssocket);
 		server->ssocket = NULL;
@@ -2194,6 +2199,9 @@ cifs_put_tcp_session(struct TCP_Server_Info *server, int from_reconnect)
 		return;
 	}
 
+	if (server->smbd_conn)
+		smbd_destroy(server->smbd_conn);
+
 	put_net(cifs_net_ns(server));
 
 	list_del_init(&server->tcp_ses_list);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 08/19] CIFS: SMBD: Set SMBDirect maximum read or write size for I/O
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (4 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 07/19] CIFS: SMBD: Destroy SMBDirect session on shutdown or umount Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 09/19] CIFS: SMBD: Read data from SMBDirect Long Li
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When connecting over SMBDirect, the transport negotiates its maximum I/O sizes with the server and determines how to choose to do RDMA send/recv vs read/write. Expose these maximum I/O sizes to upper layer so we will get the correct sized payloads.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/smb2ops.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 06494e1..e67f5f0 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -32,6 +32,7 @@
 #include "smb2status.h"
 #include "smb2glob.h"
 #include "cifs_ioctl.h"
+#include "smbdirect.h"
 
 static int
 change_conf(struct TCP_Server_Info *server)
@@ -249,7 +250,11 @@ smb2_negotiate_wsize(struct cifs_tcon *tcon, struct smb_vol *volume_info)
 
 	/* start with specified wsize, or default */
 	wsize = volume_info->wsize ? volume_info->wsize : CIFS_DEFAULT_IOSIZE;
-	wsize = min_t(unsigned int, wsize, server->max_write);
+	if (server->rdma)
+		wsize = min_t(unsigned int,
+				wsize, server->smbd_conn->max_readwrite_size);
+	else
+		wsize = min_t(unsigned int, wsize, server->max_write);
 
 	if (!(server->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU))
 		wsize = min_t(unsigned int, wsize, SMB2_MAX_BUFFER_SIZE);
@@ -265,7 +270,11 @@ smb2_negotiate_rsize(struct cifs_tcon *tcon, struct smb_vol *volume_info)
 
 	/* start with specified rsize, or default */
 	rsize = volume_info->rsize ? volume_info->rsize : CIFS_DEFAULT_IOSIZE;
-	rsize = min_t(unsigned int, rsize, server->max_read);
+	if (server->rdma)
+		rsize = min_t(unsigned int,
+				rsize, server->smbd_conn->max_readwrite_size);
+	else
+		rsize = min_t(unsigned int, rsize, server->max_read);
 
 	if (!(server->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU))
 		rsize = min_t(unsigned int, rsize, SMB2_MAX_BUFFER_SIZE);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 09/19] CIFS: SMBD: Read data from SMBDirect
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (5 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 08/19] CIFS: SMBD: Set SMBDirect maximum read or write size for I/O Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 10/19] CIFS: SMBD: Send data through SMBDirect Long Li
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

With SMBDirect connected, use it for receiving data via RDMA recv.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/connect.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index f65950f..9b0da7f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -584,6 +584,10 @@ cifs_read_from_socket(struct TCP_Server_Info *server, char *buf,
 {
 	struct msghdr smb_msg;
 	struct kvec iov = {.iov_base = buf, .iov_len = to_read};
+
+	if (server->smbd_conn)
+		return smbd_recv(server->smbd_conn, buf, to_read);
+
 	iov_iter_kvec(&smb_msg.msg_iter, READ | ITER_KVEC, &iov, 1, to_read);
 
 	return cifs_readv_from_socket(server, &smb_msg);
@@ -595,6 +599,10 @@ cifs_read_page_from_socket(struct TCP_Server_Info *server, struct page *page,
 {
 	struct msghdr smb_msg;
 	struct bio_vec bv = {.bv_page = page, .bv_len = to_read};
+
+	if (server->smbd_conn)
+		return smbd_recv_page(server->smbd_conn, page, to_read);
+
 	iov_iter_bvec(&smb_msg.msg_iter, READ | ITER_BVEC, &bv, 1, to_read);
 	return cifs_readv_from_socket(server, &smb_msg);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 10/19] CIFS: SMBD: Send data through SMBDirect
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (6 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 09/19] CIFS: SMBD: Read data from SMBDirect Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 11/19] CIFS: SMBD: Define memory registration for I/O data Long Li
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

With SMBDirect connected, use it for sending data via RDMA send.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/transport.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index ba62aaf..bddb699 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -37,6 +37,7 @@
 #include "cifsglob.h"
 #include "cifsproto.h"
 #include "cifs_debug.h"
+#include "smbdirect.h"
 
 void
 cifs_wake_up_task(struct mid_q_entry *mid)
@@ -230,6 +231,11 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct smb_rqst *rqst)
 	struct msghdr smb_msg;
 	int val = 1;
 
+	if (server->smbd_conn) {
+		rc = smbd_send(server->smbd_conn, rqst);
+		goto done;
+	}
+
 	if (ssocket == NULL)
 		return -ENOTSOCK;
 
@@ -299,6 +305,7 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct smb_rqst *rqst)
 		server->tcpStatus = CifsNeedReconnect;
 	}
 
+done:
 	if (rc < 0 && rc != -EINTR)
 		cifs_dbg(VFS, "Error %d sending data on socket to server\n",
 			 rc);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 11/19] CIFS: SMBD: Define memory registration for I/O data
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (7 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 10/19] CIFS: SMBD: Send data through SMBDirect Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 12/19] CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE Long Li
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

To prepare for RDMA read/write using memory registration, add memory registartion pointers to upper layer data I/O context.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/cifsglob.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index dc5404d..dcd2b63 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1166,6 +1166,7 @@ struct cifs_readdata {
 				struct cifs_readdata *rdata,
 				struct iov_iter *iter);
 	struct kvec			iov[2];
+	struct smbd_mr			*mr;
 	unsigned int			pagesz;
 	unsigned int			tailsz;
 	unsigned int			credits;
@@ -1188,6 +1189,7 @@ struct cifs_writedata {
 	pid_t				pid;
 	unsigned int			bytes;
 	int				result;
+	struct smbd_mr			*mr;
 	unsigned int			pagesz;
 	unsigned int			tailsz;
 	unsigned int			credits;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 12/19] CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (8 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 11/19] CIFS: SMBD: Define memory registration for I/O data Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write Long Li
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

The channel value for requesting server remote invalidating local memory registration should be 0x00000002

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/smb2pdu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/smb2pdu.h b/fs/cifs/smb2pdu.h
index 18700fd..0417a36 100644
--- a/fs/cifs/smb2pdu.h
+++ b/fs/cifs/smb2pdu.h
@@ -832,7 +832,7 @@ struct smb2_flush_rsp {
 /* Channel field for read and write: exactly one of following flags can be set*/
 #define SMB2_CHANNEL_NONE		0x00000000
 #define SMB2_CHANNEL_RDMA_V1		0x00000001 /* SMB3 or later */
-#define SMB2_CHANNEL_RDMA_V1_INVALIDATE 0x00000001 /* SMB3.02 or later */
+#define SMB2_CHANNEL_RDMA_V1_INVALIDATE 0x00000002 /* SMB3.02 or later */
 
 /* SMB2 read request without RFC1001 length at the beginning */
 struct smb2_read_plain_req {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (9 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 12/19] CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-23 13:52   ` Leon Romanovsky
  2017-08-20 19:04 ` [Patch v2 14/19] CIFS: SMBD: Deregister memory when finishing " Long Li
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When sending I/O, if size is larger than rdma_readwrite_threshold we prepare to send SMB WRITE packet for a RDMA read via memory registration. The actual I/O is done out-of-the-band, so modify the relevant fields in the packet accordingly.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/smb2pdu.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 5cc5f6c..5581afd 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -48,6 +48,7 @@
 #include "smb2glob.h"
 #include "cifspdu.h"
 #include "cifs_spnego.h"
+#include "smbdirect.h"
 
 /*
  *  The following table defines the expected "StructureSize" of SMB2 requests
@@ -2716,6 +2717,41 @@ smb2_async_writev(struct cifs_writedata *wdata,
 				offsetof(struct smb2_write_req, Buffer) - 4);
 	req->RemainingBytes = 0;
 
+	/*
+	 * If we want to do a server RDMA read, fill in and append
+	 * smbd_buffer_descriptor_v1 to the end of write request
+	 */
+	if (server->rdma && wdata->bytes >
+		server->smbd_conn->rdma_readwrite_threshold) {
+
+		struct smbd_buffer_descriptor_v1 *v1;
+		bool need_invalidate = server->dialect == SMB30_PROT_ID;
+
+		wdata->mr = smbd_register_mr(
+				server->smbd_conn, wdata->pages,
+				wdata->nr_pages, wdata->tailsz,
+				false, need_invalidate);
+		if (!wdata->mr) {
+			rc = -ENOBUFS;
+			goto async_writev_out;
+		}
+		req->Length = 0;
+		req->DataOffset = 0;
+		req->RemainingBytes =
+			(wdata->nr_pages-1)*PAGE_SIZE + wdata->tailsz;
+		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+		if (need_invalidate)
+			req->Channel = SMB2_CHANNEL_RDMA_V1;
+		req->WriteChannelInfoOffset =
+			offsetof(struct smb2_write_req, Buffer) - 4;
+		req->WriteChannelInfoLength =
+			sizeof(struct smbd_buffer_descriptor_v1);
+		v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
+		v1->offset = wdata->mr->mr->iova;
+		v1->token = wdata->mr->mr->rkey;
+		v1->length = wdata->mr->mr->length;
+	}
+
 	/* 4 for rfc1002 length field and 1 for Buffer */
 	iov[0].iov_len = 4;
 	iov[0].iov_base = req;
@@ -2729,10 +2765,17 @@ smb2_async_writev(struct cifs_writedata *wdata,
 	rqst.rq_pagesz = wdata->pagesz;
 	rqst.rq_tailsz = wdata->tailsz;
 
+	if (wdata->mr) {
+		iov[1].iov_len += sizeof(struct smbd_buffer_descriptor_v1);
+		rqst.rq_npages = 0;
+	}
+
 	cifs_dbg(FYI, "async write at %llu %u bytes\n",
 		 wdata->offset, wdata->bytes);
 
-	req->Length = cpu_to_le32(wdata->bytes);
+	/* For RDMA read, I/O size is in RemainingBytes not in Length */
+	if (!wdata->mr)
+		req->Length = cpu_to_le32(wdata->bytes);
 
 	inc_rfc1001_len(&req->hdr, wdata->bytes - 1 /* Buffer */);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 14/19] CIFS: SMBD: Deregister memory when finishing SMB write
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (10 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 15/19] CIFS: SMBD: Add parameter rdata to smb2_new_read_req Long Li
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

On write I/O finish, deregister the memory region if this was for a RDMA read. The call to smbd_deregister_mr will do local invalidation and possibly wait, if remote invalidate is not used.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/smb2pdu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 5581afd..5551053 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2666,6 +2666,18 @@ smb2_writev_callback(struct mid_q_entry *mid)
 		break;
 	}
 
+	/*
+	 * If this wdata has a memory registered, the MR can be freed
+	 * The number of MRs available is limited, it's important to recover
+	 * used MR as soon as I/O is finished. Hold MR longer in the later
+	 * I/O process can possibly result in I/O deadlock due to lack of MR
+	 * to send request on I/O retry
+	 */
+	if (wdata->mr) {
+		smbd_deregister_mr(wdata->mr);
+		wdata->mr = NULL;
+	}
+
 	if (wdata->result)
 		cifs_stats_fail_inc(tcon, SMB2_WRITE_HE);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 15/19] CIFS: SMBD: Add parameter rdata to smb2_new_read_req
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (11 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 14/19] CIFS: SMBD: Deregister memory when finishing " Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 16/19] CIFS: SMBD: Read correct returned data length for RDMA write (SMB READ) I/O Long Li
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When we assemble the SMB READ packet header, we need to know the I/O layout if this request is to use a RDMA write. rdata has all the information we need for memory registration. Add rdata to smb2_new_read_req.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/smb2pdu.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 5551053..fbad987 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2363,18 +2363,21 @@ SMB2_flush(const unsigned int xid, struct cifs_tcon *tcon, u64 persistent_fid,
  */
 static int
 smb2_new_read_req(void **buf, unsigned int *total_len,
-		  struct cifs_io_parms *io_parms, unsigned int remaining_bytes,
-		  int request_type)
+	struct cifs_io_parms *io_parms, struct cifs_readdata *rdata,
+	unsigned int remaining_bytes, int request_type)
 {
 	int rc = -EACCES;
 	struct smb2_read_plain_req *req = NULL;
 	struct smb2_sync_hdr *shdr;
+	struct TCP_Server_Info *server;
 
 	rc = smb2_plain_req_init(SMB2_READ, io_parms->tcon, (void **) &req,
 				 total_len);
 	if (rc)
 		return rc;
-	if (io_parms->tcon->ses->server == NULL)
+
+	server = io_parms->tcon->ses->server;
+	if (server == NULL)
 		return -ECONNABORTED;
 
 	shdr = &req->sync_hdr;
@@ -2502,7 +2505,8 @@ smb2_async_readv(struct cifs_readdata *rdata)
 
 	server = io_parms.tcon->ses->server;
 
-	rc = smb2_new_read_req((void **) &buf, &total_len, &io_parms, 0, 0);
+	rc = smb2_new_read_req(
+		(void **) &buf, &total_len, &io_parms, rdata, 0, 0);
 	if (rc) {
 		if (rc == -EAGAIN && rdata->credits) {
 			/* credits was reset by reconnect */
@@ -2570,7 +2574,7 @@ SMB2_read(const unsigned int xid, struct cifs_io_parms *io_parms,
 	struct cifs_ses *ses = io_parms->tcon->ses;
 
 	*nbytes = 0;
-	rc = smb2_new_read_req((void **)&req, &total_len, io_parms, 0, 0);
+	rc = smb2_new_read_req((void **)&req, &total_len, io_parms, NULL, 0, 0);
 	if (rc)
 		return rc;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 16/19] CIFS: SMBD: Read correct returned data length for RDMA write (SMB READ) I/O
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (12 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 15/19] CIFS: SMBD: Add parameter rdata to smb2_new_read_req Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 17/19] CIFS: SMBD: Implement SMB READ via RDMA write through memory registration Long Li
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When RDMA write is used for SMB READ, the returned data length is returned in DataRemaining in the response packet. Reading it properly by adding a parameter to specifiy where the returned data length is.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/cifsglob.h | 10 ++++++++--
 fs/cifs/cifssmb.c  |  4 ++--
 fs/cifs/smb1ops.c  |  2 +-
 fs/cifs/smb2ops.c  |  8 ++++++--
 4 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index dcd2b63..d391767 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -231,8 +231,14 @@ struct smb_version_operations {
 	__u64 (*get_next_mid)(struct TCP_Server_Info *);
 	/* data offset from read response message */
 	unsigned int (*read_data_offset)(char *);
-	/* data length from read response message */
-	unsigned int (*read_data_length)(char *);
+	/*
+	 * Data length from read response message
+	 * When in_remaining is true, the returned data length is in
+	 * message field DataRemaining for out-of-band data read (e.g through
+	 * Memory Registration RDMA write in SMBD).
+	 * Otherwise, the returned data length is in message field DataLength.
+	 */
+	unsigned int (*read_data_length)(char *, bool in_remaining);
 	/* map smb to linux error */
 	int (*map_error)(char *, bool);
 	/* find mid corresponding to the response message */
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index fbb0d4c..9030fb5 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1523,8 +1523,8 @@ cifs_readv_receive(struct TCP_Server_Info *server, struct mid_q_entry *mid)
 		 rdata->iov[0].iov_base, server->total_read);
 
 	/* how much data is in the response? */
-	data_len = server->ops->read_data_length(buf);
-	if (data_offset + data_len > buflen) {
+	data_len = server->ops->read_data_length(buf, rdata->mr);
+	if (!rdata->mr && (data_offset + data_len > buflen)) {
 		/* data_len is corrupt -- discard frame */
 		rdata->result = -EIO;
 		return cifs_readv_discard(server, mid);
diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c
index a723df3..27a8280 100644
--- a/fs/cifs/smb1ops.c
+++ b/fs/cifs/smb1ops.c
@@ -87,7 +87,7 @@ cifs_read_data_offset(char *buf)
 }
 
 static unsigned int
-cifs_read_data_length(char *buf)
+cifs_read_data_length(char *buf, bool in_remaining)
 {
 	READ_RSP *rsp = (READ_RSP *)buf;
 	return (le16_to_cpu(rsp->DataLengthHigh) << 16) +
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index e67f5f0..4067629 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -747,9 +747,13 @@ smb2_read_data_offset(char *buf)
 }
 
 static unsigned int
-smb2_read_data_length(char *buf)
+smb2_read_data_length(char *buf, bool in_remaining)
 {
 	struct smb2_read_rsp *rsp = (struct smb2_read_rsp *)buf;
+
+	if (in_remaining)
+		return le32_to_cpu(rsp->DataRemaining);
+
 	return le32_to_cpu(rsp->DataLength);
 }
 
@@ -2181,7 +2185,7 @@ handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
 	}
 
 	data_offset = server->ops->read_data_offset(buf) + 4;
-	data_len = server->ops->read_data_length(buf);
+	data_len = server->ops->read_data_length(buf, rdata->mr);
 
 	if (data_offset < server->vals->read_rsp_size) {
 		/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 17/19] CIFS: SMBD: Implement SMB READ via RDMA write through memory registration
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (13 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 16/19] CIFS: SMBD: Read correct returned data length for RDMA write (SMB READ) I/O Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 18/19] CIFS: SMBD: Deregister memory when finishing SMB READ Long Li
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

If I/O size is larger than rdma_readwrite_threshold, use RDMA write for SMB READ by specifying channel SMB2_CHANNEL_RDMA_V1 or SMB2_CHANNEL_RDMA_V1_INVALIDATE, depending on SMB dialect used. When RDMA write is used, there is no need to read from the transport for incoming payload. At the time SMB READ response comes back, the data is already transfered and placed in the pages by RDMA.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/file.c    |  5 +++++
 fs/cifs/smb2pdu.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index dec70b3..41460a5 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -42,6 +42,7 @@
 #include "cifs_debug.h"
 #include "cifs_fs_sb.h"
 #include "fscache.h"
+#include "smbdirect.h"
 
 
 static inline int cifs_convert_flags(unsigned int flags)
@@ -3037,6 +3038,8 @@ uncached_fill_pages(struct TCP_Server_Info *server,
 		}
 		if (iter)
 			result = copy_page_from_iter(page, 0, n, iter);
+		else if (rdata->mr)
+			result = n;
 		else
 			result = cifs_read_page_from_socket(server, page, n);
 		if (result < 0)
@@ -3606,6 +3609,8 @@ readpages_fill_pages(struct TCP_Server_Info *server,
 
 		if (iter)
 			result = copy_page_from_iter(page, 0, n, iter);
+		else if (rdata->mr)
+			result = n;
 		else
 			result = cifs_read_page_from_socket(server, page, n);
 		if (result < 0)
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index fbad987..1f08c75 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2392,6 +2392,39 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
 	req->Length = cpu_to_le32(io_parms->length);
 	req->Offset = cpu_to_le64(io_parms->offset);
 
+	/*
+	 * If we want to do a RDMA write, fill in and append
+	 * smbd_buffer_descriptor_v1 to the end of read request
+	 */
+	if (server->rdma && rdata &&
+		rdata->bytes > server->smbd_conn->rdma_readwrite_threshold) {
+
+		struct smbd_buffer_descriptor_v1 *v1;
+		bool need_invalidate =
+			io_parms->tcon->ses->server->dialect == SMB30_PROT_ID;
+
+		rdata->mr = smbd_register_mr(
+				server->smbd_conn, rdata->pages,
+				rdata->nr_pages, rdata->tailsz,
+				true, need_invalidate);
+		if (!rdata->mr)
+			return -ENOBUFS;
+
+		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+		if (need_invalidate)
+			req->Channel = SMB2_CHANNEL_RDMA_V1;
+		req->ReadChannelInfoOffset =
+			offsetof(struct smb2_read_plain_req, Buffer);
+		req->ReadChannelInfoLength =
+			sizeof(struct smbd_buffer_descriptor_v1);
+		v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
+		v1->offset = rdata->mr->mr->iova;
+		v1->token = rdata->mr->mr->rkey;
+		v1->length = rdata->mr->mr->length;
+
+		*total_len += sizeof(*v1) - 1;
+	}
+
 	if (request_type & CHAINED_REQUEST) {
 		if (!(request_type & END_OF_CHAIN)) {
 			/* next 8-byte aligned request */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 18/19] CIFS: SMBD: Deregister memory when finishing SMB READ
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (14 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 17/19] CIFS: SMBD: Implement SMB READ via RDMA write through memory registration Long Li
@ 2017-08-20 19:04 ` Long Li
  2017-08-20 19:04 ` [Patch v2 19/19] CIFS: SMBD: Add SMBDirect debug counters Long Li
       [not found] ` <1503255883-3041-1-git-send-email-longli-Lp/cVzEoVyZiJJESP9tAQJZ3qXmFLfmx@public.gmane.org>
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When SMB READ is finished, deregister the memory regions if RDMA write is used for this SMB READ. smbd_deregister_mr may need to do local invalidation and sleep, if server remote invalidation is not used.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/smb2pdu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 1f08c75..43a7b60 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2504,6 +2504,16 @@ smb2_readv_callback(struct mid_q_entry *mid)
 			rdata->result = -EIO;
 	}
 
+	/*
+	 * If this rdata has a memmory registered, the MR can be freed
+	 * MR needs to be freed as soon as I/O finishes to prevent deadlock
+	 * because they have limited number and are used for future I/Os
+	 */
+	if (rdata->mr) {
+		smbd_deregister_mr(rdata->mr);
+		rdata->mr = NULL;
+	}
+
 	if (rdata->result)
 		cifs_stats_fail_inc(tcon, SMB2_READ_HE);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Patch v2 19/19] CIFS: SMBD: Add SMBDirect debug counters
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
                   ` (15 preceding siblings ...)
  2017-08-20 19:04 ` [Patch v2 18/19] CIFS: SMBD: Deregister memory when finishing SMB READ Long Li
@ 2017-08-20 19:04 ` Long Li
       [not found] ` <1503255883-3041-1-git-send-email-longli-Lp/cVzEoVyZiJJESP9tAQJZ3qXmFLfmx@public.gmane.org>
  17 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-20 19:04 UTC (permalink / raw)
  To: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox
  Cc: Long Li

From: Long Li <longli@microsoft.com>

Export SMBDirect debug counters to /proc/fs/cifs/DebugData, for debugging and troubleshooting.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/cifs_debug.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index ba0870d..ca73950 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -30,6 +30,7 @@
 #include "cifsproto.h"
 #include "cifs_debug.h"
 #include "cifsfs.h"
+#include "smbdirect.h"
 
 void
 cifs_dump_mem(char *label, void *data, int length)
@@ -152,6 +153,51 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
 	list_for_each(tmp1, &cifs_tcp_ses_list) {
 		server = list_entry(tmp1, struct TCP_Server_Info,
 				    tcp_ses_list);
+
+		if (!server->rdma)
+			goto skip_rdma;
+
+		seq_printf(m, "\nSMBDirect transport status (all numbers in hex) protocol version: %x",
+			server->smbd_conn->protocol);
+		seq_printf(m, "\nConn receive_credit_max: %x send_credit_target: %x max_send_size: %x",
+			server->smbd_conn->receive_credit_max,
+			server->smbd_conn->send_credit_target,
+			server->smbd_conn->max_send_size);
+		seq_printf(m, "\nConn max_fragmented_recv_size: %x max_fragmented_send_size: %x max_receive_size:%x",
+			server->smbd_conn->max_fragmented_recv_size,
+			server->smbd_conn->max_fragmented_send_size,
+			server->smbd_conn->max_receive_size);
+		seq_printf(m, "\nConn keep_alive_interval: %x max_readwrite_size: %x rdma_readwrite_threshold: %x",
+			server->smbd_conn->keep_alive_interval,
+			server->smbd_conn->max_readwrite_size,
+			server->smbd_conn->rdma_readwrite_threshold);
+		seq_printf(m, "\nDebug count_receive_buffer: %x count_get_receive_buffer: %x count_put_receive_buffer: %x count_send_empty: %x",
+			server->smbd_conn->count_receive_buffer,
+			server->smbd_conn->count_get_receive_buffer,
+			server->smbd_conn->count_put_receive_buffer,
+			server->smbd_conn->count_send_empty);
+		seq_printf(m, "\nRead Queue count_reassembly_queue: %x count_enqueue_reassembly_queue: %x count_dequeue_reassembly_queue: %x fragment_reassembly_remaining: %x",
+			server->smbd_conn->count_reassembly_queue,
+			server->smbd_conn->count_enqueue_reassembly_queue,
+			server->smbd_conn->count_dequeue_reassembly_queue,
+			server->smbd_conn->fragment_reassembly_remaining);
+		seq_printf(m, "\nCurrent Credits send_credits: %x receive_credits: %x receive_credit_target: %x",
+			atomic_read(&server->smbd_conn->send_credits),
+			atomic_read(&server->smbd_conn->receive_credits),
+			server->smbd_conn->receive_credit_target);
+		seq_printf(m, "\nPending send_pending: %x send_payload_pending: %x recv_pending: %x read_pending: %x",
+			atomic_read(&server->smbd_conn->send_pending),
+			atomic_read(&server->smbd_conn->send_payload_pending),
+			atomic_read(&server->smbd_conn->recv_pending),
+			atomic_read(&server->smbd_conn->read_pending));
+		seq_printf(m, "\nMR responder_resources: %x max_frmr_depth: %x mr_type: %x",
+			server->smbd_conn->responder_resources,
+			server->smbd_conn->max_frmr_depth,
+			server->smbd_conn->mr_type);
+		seq_printf(m, "\nMR mr_ready_count: %x",
+			atomic_read(&server->smbd_conn->mr_ready_count));
+
+skip_rdma:
 		seq_printf(m, "\nNumber of credits: %d", server->credits);
 		i++;
 		list_for_each(tmp2, &server->smb_ses_list) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [Patch v2 01/19] CIFS: Add RDMA mount option
  2017-08-20 19:04     ` Long Li
@ 2017-08-21  4:36         ` Leon Romanovsky
  -1 siblings, 0 replies; 53+ messages in thread
From: Leon Romanovsky @ 2017-08-21  4:36 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox, Long Li

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]

On Sun, Aug 20, 2017 at 12:04:25PM -0700, Long Li wrote:
> From: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
>
> Add "rdma" to CIFS mount option, which tells CIFS this is for connecting to a SMB server over SMBDirect. Add checks to validate this feature is only used on SMB 3.X dialects.
>
> To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".
>
> Signed-off-by: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> ---
>  fs/cifs/cifs_debug.c |  2 ++
>  fs/cifs/cifsfs.c     |  2 ++
>  fs/cifs/cifsglob.h   |  3 +++
>  fs/cifs/connect.c    | 25 ++++++++++++++++++++++++-
>  4 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
> index 9727e1d..ba0870d 100644
> --- a/fs/cifs/cifs_debug.c
> +++ b/fs/cifs/cifs_debug.c
> @@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
>  				ses->ses_count, ses->serverOS, ses->serverNOS,
>  				ses->capabilities, ses->status);
>  			}
> +			if (server->rdma)
> +				seq_printf(m, "RDMA\n\t");
>  			seq_printf(m, "TCP status: %d\n\tLocal Users To "
>  				   "Server: %d SecMode: 0x%x Req On Wire: %d",
>  				   server->tcpStatus, server->srv_count,
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index fe0c8dc..a628800 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -330,6 +330,8 @@ cifs_show_address(struct seq_file *s, struct TCP_Server_Info *server)
>  	default:
>  		seq_puts(s, "(unknown)");
>  	}
> +	if (server->rdma)
> +		seq_puts(s, ",rdma");
>  }
>
>  static void
> diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
> index 8289f95..703c2fb 100644
> --- a/fs/cifs/cifsglob.h
> +++ b/fs/cifs/cifsglob.h
> @@ -531,6 +531,7 @@ struct smb_vol {
>  	bool nopersistent:1;
>  	bool resilient:1; /* noresilient not required since not fored for CA */
>  	bool domainauto:1;
> +	bool rdma:1;
>  	unsigned int rsize;
>  	unsigned int wsize;
>  	bool sockopt_tcp_nodelay:1;
> @@ -649,6 +650,8 @@ struct TCP_Server_Info {
>  	bool	sec_kerberos;		/* supports plain Kerberos */
>  	bool	sec_mskerberos;		/* supports legacy MS Kerberos */
>  	bool	large_buf;		/* is current buffer large? */
> +	/* use SMBD connection instead of socket */
> +	bool	rdma;
>  	struct delayed_work	echo; /* echo ping workqueue job */
>  	char	*smallbuf;	/* pointer to current "small" buffer */
>  	char	*bigbuf;	/* pointer to current "big" buffer */
> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> index 2eeaac6..d5d0ecd 100644
> --- a/fs/cifs/connect.c
> +++ b/fs/cifs/connect.c
> @@ -94,7 +94,7 @@ enum {
>  	Opt_multiuser, Opt_sloppy, Opt_nosharesock,
>  	Opt_persistent, Opt_nopersistent,
>  	Opt_resilient, Opt_noresilient,
> -	Opt_domainauto,
> +	Opt_domainauto, Opt_rdma,
>
>  	/* Mount options which take numeric value */
>  	Opt_backupuid, Opt_backupgid, Opt_uid,
> @@ -185,6 +185,7 @@ static const match_table_t cifs_mount_option_tokens = {
>  	{ Opt_resilient, "resilienthandles"},
>  	{ Opt_noresilient, "noresilienthandles"},
>  	{ Opt_domainauto, "domainauto"},
> +	{ Opt_rdma, "rdma"},
>
>  	{ Opt_backupuid, "backupuid=%s" },
>  	{ Opt_backupgid, "backupgid=%s" },
> @@ -1541,6 +1542,9 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
>  		case Opt_domainauto:
>  			vol->domainauto = true;
>  			break;
> +		case Opt_rdma:
> +			vol->rdma = true;
> +			break;
>
>  		/* Numeric Values */
>  		case Opt_backupuid:
> @@ -1931,6 +1935,21 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
>  		goto cifs_parse_mount_err;
>  	}
>
> +	if (vol->rdma) {
> +		switch (vol->vals->protocol_id) {
> +		case SMB30_PROT_ID:
> +		case SMB302_PROT_ID:
> +		case SMB311_PROT_ID:
> +			break;

In cover letter, you wrote that this option is relevant for 3.X versions,
so why do you write explicitly versions and don't do check if (version > SMB30_PROT_ID) ...?

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 01/19] CIFS: Add RDMA mount option
@ 2017-08-21  4:36         ` Leon Romanovsky
  0 siblings, 0 replies; 53+ messages in thread
From: Leon Romanovsky @ 2017-08-21  4:36 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox,
	Long Li

[-- Attachment #1: Type: text/plain, Size: 3825 bytes --]

On Sun, Aug 20, 2017 at 12:04:25PM -0700, Long Li wrote:
> From: Long Li <longli@microsoft.com>
>
> Add "rdma" to CIFS mount option, which tells CIFS this is for connecting to a SMB server over SMBDirect. Add checks to validate this feature is only used on SMB 3.X dialects.
>
> To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".
>
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  fs/cifs/cifs_debug.c |  2 ++
>  fs/cifs/cifsfs.c     |  2 ++
>  fs/cifs/cifsglob.h   |  3 +++
>  fs/cifs/connect.c    | 25 ++++++++++++++++++++++++-
>  4 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
> index 9727e1d..ba0870d 100644
> --- a/fs/cifs/cifs_debug.c
> +++ b/fs/cifs/cifs_debug.c
> @@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
>  				ses->ses_count, ses->serverOS, ses->serverNOS,
>  				ses->capabilities, ses->status);
>  			}
> +			if (server->rdma)
> +				seq_printf(m, "RDMA\n\t");
>  			seq_printf(m, "TCP status: %d\n\tLocal Users To "
>  				   "Server: %d SecMode: 0x%x Req On Wire: %d",
>  				   server->tcpStatus, server->srv_count,
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index fe0c8dc..a628800 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -330,6 +330,8 @@ cifs_show_address(struct seq_file *s, struct TCP_Server_Info *server)
>  	default:
>  		seq_puts(s, "(unknown)");
>  	}
> +	if (server->rdma)
> +		seq_puts(s, ",rdma");
>  }
>
>  static void
> diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
> index 8289f95..703c2fb 100644
> --- a/fs/cifs/cifsglob.h
> +++ b/fs/cifs/cifsglob.h
> @@ -531,6 +531,7 @@ struct smb_vol {
>  	bool nopersistent:1;
>  	bool resilient:1; /* noresilient not required since not fored for CA */
>  	bool domainauto:1;
> +	bool rdma:1;
>  	unsigned int rsize;
>  	unsigned int wsize;
>  	bool sockopt_tcp_nodelay:1;
> @@ -649,6 +650,8 @@ struct TCP_Server_Info {
>  	bool	sec_kerberos;		/* supports plain Kerberos */
>  	bool	sec_mskerberos;		/* supports legacy MS Kerberos */
>  	bool	large_buf;		/* is current buffer large? */
> +	/* use SMBD connection instead of socket */
> +	bool	rdma;
>  	struct delayed_work	echo; /* echo ping workqueue job */
>  	char	*smallbuf;	/* pointer to current "small" buffer */
>  	char	*bigbuf;	/* pointer to current "big" buffer */
> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> index 2eeaac6..d5d0ecd 100644
> --- a/fs/cifs/connect.c
> +++ b/fs/cifs/connect.c
> @@ -94,7 +94,7 @@ enum {
>  	Opt_multiuser, Opt_sloppy, Opt_nosharesock,
>  	Opt_persistent, Opt_nopersistent,
>  	Opt_resilient, Opt_noresilient,
> -	Opt_domainauto,
> +	Opt_domainauto, Opt_rdma,
>
>  	/* Mount options which take numeric value */
>  	Opt_backupuid, Opt_backupgid, Opt_uid,
> @@ -185,6 +185,7 @@ static const match_table_t cifs_mount_option_tokens = {
>  	{ Opt_resilient, "resilienthandles"},
>  	{ Opt_noresilient, "noresilienthandles"},
>  	{ Opt_domainauto, "domainauto"},
> +	{ Opt_rdma, "rdma"},
>
>  	{ Opt_backupuid, "backupuid=%s" },
>  	{ Opt_backupgid, "backupgid=%s" },
> @@ -1541,6 +1542,9 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
>  		case Opt_domainauto:
>  			vol->domainauto = true;
>  			break;
> +		case Opt_rdma:
> +			vol->rdma = true;
> +			break;
>
>  		/* Numeric Values */
>  		case Opt_backupuid:
> @@ -1931,6 +1935,21 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
>  		goto cifs_parse_mount_err;
>  	}
>
> +	if (vol->rdma) {
> +		switch (vol->vals->protocol_id) {
> +		case SMB30_PROT_ID:
> +		case SMB302_PROT_ID:
> +		case SMB311_PROT_ID:
> +			break;

In cover letter, you wrote that this option is relevant for 3.X versions,
so why do you write explicitly versions and don't do check if (version > SMB30_PROT_ID) ...?

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 01/19] CIFS: Add RDMA mount option
  2017-08-21  4:36         ` Leon Romanovsky
  (?)
@ 2017-08-21 18:18         ` Long Li
  -1 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-21 18:18 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox



> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: Sunday, August 20, 2017 9:36 PM
> To: Long Li <longli@microsoft.com>
> Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> technical@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
> <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>;
> Long Li <longli@microsoft.com>
> Subject: Re: [Patch v2 01/19] CIFS: Add RDMA mount option
> 
> On Sun, Aug 20, 2017 at 12:04:25PM -0700, Long Li wrote:
> > From: Long Li <longli@microsoft.com>
> >
> > Add "rdma" to CIFS mount option, which tells CIFS this is for connecting to a
> SMB server over SMBDirect. Add checks to validate this feature is only used
> on SMB 3.X dialects.
> >
> > To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".
> >
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> >  fs/cifs/cifs_debug.c |  2 ++
> >  fs/cifs/cifsfs.c     |  2 ++
> >  fs/cifs/cifsglob.h   |  3 +++
> >  fs/cifs/connect.c    | 25 ++++++++++++++++++++++++-
> >  4 files changed, 31 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c index
> > 9727e1d..ba0870d 100644
> > --- a/fs/cifs/cifs_debug.c
> > +++ b/fs/cifs/cifs_debug.c
> > @@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct
> seq_file *m, void *v)
> >  				ses->ses_count, ses->serverOS, ses-
> >serverNOS,
> >  				ses->capabilities, ses->status);
> >  			}
> > +			if (server->rdma)
> > +				seq_printf(m, "RDMA\n\t");
> >  			seq_printf(m, "TCP status: %d\n\tLocal Users To "
> >  				   "Server: %d SecMode: 0x%x Req On
> Wire: %d",
> >  				   server->tcpStatus, server->srv_count, diff -
> -git
> > a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index fe0c8dc..a628800 100644
> > --- a/fs/cifs/cifsfs.c
> > +++ b/fs/cifs/cifsfs.c
> > @@ -330,6 +330,8 @@ cifs_show_address(struct seq_file *s, struct
> TCP_Server_Info *server)
> >  	default:
> >  		seq_puts(s, "(unknown)");
> >  	}
> > +	if (server->rdma)
> > +		seq_puts(s, ",rdma");
> >  }
> >
> >  static void
> > diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h index
> > 8289f95..703c2fb 100644
> > --- a/fs/cifs/cifsglob.h
> > +++ b/fs/cifs/cifsglob.h
> > @@ -531,6 +531,7 @@ struct smb_vol {
> >  	bool nopersistent:1;
> >  	bool resilient:1; /* noresilient not required since not fored for CA */
> >  	bool domainauto:1;
> > +	bool rdma:1;
> >  	unsigned int rsize;
> >  	unsigned int wsize;
> >  	bool sockopt_tcp_nodelay:1;
> > @@ -649,6 +650,8 @@ struct TCP_Server_Info {
> >  	bool	sec_kerberos;		/* supports plain Kerberos */
> >  	bool	sec_mskerberos;		/* supports legacy MS
> Kerberos */
> >  	bool	large_buf;		/* is current buffer large? */
> > +	/* use SMBD connection instead of socket */
> > +	bool	rdma;
> >  	struct delayed_work	echo; /* echo ping workqueue job */
> >  	char	*smallbuf;	/* pointer to current "small" buffer */
> >  	char	*bigbuf;	/* pointer to current "big" buffer */
> > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index
> > 2eeaac6..d5d0ecd 100644
> > --- a/fs/cifs/connect.c
> > +++ b/fs/cifs/connect.c
> > @@ -94,7 +94,7 @@ enum {
> >  	Opt_multiuser, Opt_sloppy, Opt_nosharesock,
> >  	Opt_persistent, Opt_nopersistent,
> >  	Opt_resilient, Opt_noresilient,
> > -	Opt_domainauto,
> > +	Opt_domainauto, Opt_rdma,
> >
> >  	/* Mount options which take numeric value */
> >  	Opt_backupuid, Opt_backupgid, Opt_uid, @@ -185,6 +185,7 @@
> static
> > const match_table_t cifs_mount_option_tokens = {
> >  	{ Opt_resilient, "resilienthandles"},
> >  	{ Opt_noresilient, "noresilienthandles"},
> >  	{ Opt_domainauto, "domainauto"},
> > +	{ Opt_rdma, "rdma"},
> >
> >  	{ Opt_backupuid, "backupuid=%s" },
> >  	{ Opt_backupgid, "backupgid=%s" },
> > @@ -1541,6 +1542,9 @@ cifs_parse_mount_options(const char
> *mountdata, const char *devname,
> >  		case Opt_domainauto:
> >  			vol->domainauto = true;
> >  			break;
> > +		case Opt_rdma:
> > +			vol->rdma = true;
> > +			break;
> >
> >  		/* Numeric Values */
> >  		case Opt_backupuid:
> > @@ -1931,6 +1935,21 @@ cifs_parse_mount_options(const char
> *mountdata, const char *devname,
> >  		goto cifs_parse_mount_err;
> >  	}
> >
> > +	if (vol->rdma) {
> > +		switch (vol->vals->protocol_id) {
> > +		case SMB30_PROT_ID:
> > +		case SMB302_PROT_ID:
> > +		case SMB311_PROT_ID:
> > +			break;
> 
> In cover letter, you wrote that this option is relevant for 3.X versions, so why
> do you write explicitly versions and don't do check if (version >
> SMB30_PROT_ID) ...?

Thanks for pointing it out. I will fix this.

> 
> Thanks

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
       [not found] ` <1503255883-3041-1-git-send-email-longli-Lp/cVzEoVyZiJJESP9tAQJZ3qXmFLfmx@public.gmane.org>
  2017-08-20 19:04     ` Long Li
@ 2017-08-21 19:15     ` Steve Wise
  2017-08-21 19:15     ` Steve Wise
  2017-08-29 18:20     ` Roland Dreier
  3 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2017-08-21 19:15 UTC (permalink / raw)
  To: 'Long Li', 'Steve French',
	linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Christoph Hellwig',
	'Tom Talpey', 'Matthew Wilcox'
  Cc: 'Long Li'

> 
> From: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> 
> Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport
protocol
> for transferring upper layer (SMB2) payload over RDMA via Infiniband, RoCE or
> iWARP. The prococol is published in [MS-SMBD] (https://msdn.microsoft.com/en-
> us/library/hh536346.aspx).
> 
> The patch v2 added RDMA read/write via memory registration, and addressed
> feedbacks on v1.
> 

Hey Long,

What testing have you done with this on the various rdma transports?  Does it
work over IB, RoCE, and iWARP providers?

Thanks,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-08-21 19:15     ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2017-08-21 19:15 UTC (permalink / raw)
  To: 'Long Li', 'Steve French',
	linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Christoph Hellwig',
	'Tom Talpey', 'Matthew Wilcox'
  Cc: 'Long Li'

> 
> From: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> 
> Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport
protocol
> for transferring upper layer (SMB2) payload over RDMA via Infiniband, RoCE or
> iWARP. The prococol is published in [MS-SMBD] (https://msdn.microsoft.com/en-
> us/library/hh536346.aspx).
> 
> The patch v2 added RDMA read/write via memory registration, and addressed
> feedbacks on v1.
> 

Hey Long,

What testing have you done with this on the various rdma transports?  Does it
work over IB, RoCE, and iWARP providers?

Thanks,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-08-21 19:15     ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2017-08-21 19:15 UTC (permalink / raw)
  To: 'Long Li', 'Steve French',
	linux-cifs, samba-technical, linux-kernel, linux-rdma,
	'Christoph Hellwig', 'Tom Talpey',
	'Matthew Wilcox'
  Cc: 'Long Li'

> 
> From: Long Li <longli@microsoft.com>
> 
> Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport
protocol
> for transferring upper layer (SMB2) payload over RDMA via Infiniband, RoCE or
> iWARP. The prococol is published in [MS-SMBD] (https://msdn.microsoft.com/en-
> us/library/hh536346.aspx).
> 
> The patch v2 added RDMA read/write via memory registration, and addressed
> feedbacks on v1.
> 

Hey Long,

What testing have you done with this on the various rdma transports?  Does it
work over IB, RoCE, and iWARP providers?

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
  2017-08-21 19:15     ` Steve Wise
  (?)
  (?)
@ 2017-08-21 19:50     ` Long Li
  2017-08-21 19:56         ` Steve Wise
  -1 siblings, 1 reply; 53+ messages in thread
From: Long Li @ 2017-08-21 19:50 UTC (permalink / raw)
  To: Steve Wise, 'Steve French',
	linux-cifs, samba-technical, linux-kernel, linux-rdma,
	'Christoph Hellwig',
	Tom Talpey, Matthew Wilcox



> -----Original Message-----
> From: Steve Wise [mailto:swise@opengridcomputing.com]
> Sent: Monday, August 21, 2017 12:15 PM
> To: Long Li <longli@microsoft.com>; 'Steve French' <sfrench@samba.org>;
> linux-cifs@vger.kernel.org; samba-technical@lists.samba.org; linux-
> kernel@vger.kernel.org; linux-rdma@vger.kernel.org; 'Christoph Hellwig'
> <hch@infradead.org>; Tom Talpey <ttalpey@microsoft.com>; Matthew
> Wilcox <mawilcox@microsoft.com>
> Cc: Long Li <longli@microsoft.com>
> Subject: RE: [Patch v2 00/19] CIFS: Implement SMBDirect
> 
> [You don't often get email from SWISE@OPENGRIDCOMPUTING.COM. Learn
> why this is important at http://aka.ms/LearnAboutSenderIdentification.]
> 
> >
> > From: Long Li <longli@microsoft.com>
> >
> > Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect
> > transport
> protocol
> > for transferring upper layer (SMB2) payload over RDMA via Infiniband,
> > RoCE or iWARP. The prococol is published in [MS-SMBD]
> >
> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmsdn
> > .microsoft.com%2Fen-
> &data=02%7C01%7Clongli%40microsoft.com%7C6082b57a9
> >
> 13844901b5b08d4e8c8ec1d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7
> C0%7C6
> >
> 36389397044862458&sdata=MnF3ljWT9cTEfFZWmj7zVADgNSYdMFROK%2B
> WXAdfG%2FB
> > I%3D&reserved=0
> > us/library/hh536346.aspx).
> >
> > The patch v2 added RDMA read/write via memory registration, and
> > addressed feedbacks on v1.
> >
> 
> Hey Long,
> 
> What testing have you done with this on the various rdma transports?  Does
> it work over IB, RoCE, and iWARP providers?

Hi Steve,

Currently all the tests have been done over Infiniband. We haven't tested on RoCE or iWARP, but planned to do it in the following weeks.

Long

> 
> Thanks,
> 
> Steve.
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
  2017-08-21 19:50     ` Long Li
@ 2017-08-21 19:56         ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2017-08-21 19:56 UTC (permalink / raw)
  To: 'Long Li', 'Steve French',
	linux-cifs, samba-technical, linux-kernel, linux-rdma,
	'Christoph Hellwig', 'Tom Talpey',
	'Matthew Wilcox'

> >
> > Hey Long,
> >
> > What testing have you done with this on the various rdma transports?  Does
> > it work over IB, RoCE, and iWARP providers?
> 
> Hi Steve,
> 
> Currently all the tests have been done over Infiniband. We haven't tested on
RoCE
> or iWARP, but planned to do it in the following weeks.
> 
> Long

Ok, good.

Is this series available on github or somewhere so we can clone it and review it
as it is applied to the kernel src?

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-08-21 19:56         ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2017-08-21 19:56 UTC (permalink / raw)
  To: 'Long Li', 'Steve French',
	linux-cifs, samba-technical, linux-kernel, linux-rdma,
	'Christoph Hellwig', 'Tom Talpey',
	'Matthew Wilcox'

> >
> > Hey Long,
> >
> > What testing have you done with this on the various rdma transports?  Does
> > it work over IB, RoCE, and iWARP providers?
> 
> Hi Steve,
> 
> Currently all the tests have been done over Infiniband. We haven't tested on
RoCE
> or iWARP, but planned to do it in the following weeks.
> 
> Long

Ok, good.

Is this series available on github or somewhere so we can clone it and review it
as it is applied to the kernel src?

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
  2017-08-21 19:56         ` Steve Wise
  (?)
@ 2017-08-21 20:23         ` Long Li
  2017-08-29 18:10           ` Long Li
  -1 siblings, 1 reply; 53+ messages in thread
From: Long Li @ 2017-08-21 20:23 UTC (permalink / raw)
  To: Steve Wise, 'Steve French',
	linux-cifs, samba-technical, linux-kernel, linux-rdma,
	'Christoph Hellwig',
	Tom Talpey, Matthew Wilcox

> > > Hey Long,
> > >
> > > What testing have you done with this on the various rdma transports?
> > > Does it work over IB, RoCE, and iWARP providers?
> >
> > Hi Steve,
> >
> > Currently all the tests have been done over Infiniband. We haven't
> > tested on
> RoCE
> > or iWARP, but planned to do it in the following weeks.
> >
> > Long
> 
> Ok, good.
> 
> Is this series available on github or somewhere so we can clone it and review
> it as it is applied to the kernel src?

Unfortunately they are not on github. I will look into putting them there for review. Will update soon.

Thanks for helping out!

> 
> Thanks,
> 
> Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
  2017-08-20 19:04 ` [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write Long Li
@ 2017-08-23 13:52   ` Leon Romanovsky
       [not found]     ` <20170823135200.GP1724-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Leon Romanovsky @ 2017-08-23 13:52 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox,
	Long Li

[-- Attachment #1: Type: text/plain, Size: 3111 bytes --]

On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> From: Long Li <longli@microsoft.com>
>
> When sending I/O, if size is larger than rdma_readwrite_threshold we prepare to send SMB WRITE packet for a RDMA read via memory registration. The actual I/O is done out-of-the-band, so modify the relevant fields in the packet accordingly.
>
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  fs/cifs/smb2pdu.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
> index 5cc5f6c..5581afd 100644
> --- a/fs/cifs/smb2pdu.c
> +++ b/fs/cifs/smb2pdu.c
> @@ -48,6 +48,7 @@
>  #include "smb2glob.h"
>  #include "cifspdu.h"
>  #include "cifs_spnego.h"
> +#include "smbdirect.h"
>
>  /*
>   *  The following table defines the expected "StructureSize" of SMB2 requests
> @@ -2716,6 +2717,41 @@ smb2_async_writev(struct cifs_writedata *wdata,
>  				offsetof(struct smb2_write_req, Buffer) - 4);
>  	req->RemainingBytes = 0;
>
> +	/*
> +	 * If we want to do a server RDMA read, fill in and append
> +	 * smbd_buffer_descriptor_v1 to the end of write request
> +	 */
> +	if (server->rdma && wdata->bytes >
> +		server->smbd_conn->rdma_readwrite_threshold) {
> +
> +		struct smbd_buffer_descriptor_v1 *v1;
> +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> +
> +		wdata->mr = smbd_register_mr(
> +				server->smbd_conn, wdata->pages,
> +				wdata->nr_pages, wdata->tailsz,
> +				false, need_invalidate);
> +		if (!wdata->mr) {
> +			rc = -ENOBUFS;
> +			goto async_writev_out;
> +		}
> +		req->Length = 0;
> +		req->DataOffset = 0;
> +		req->RemainingBytes =

Wow, we have CamelCase variables in linux kernel. It will help if you
start your patchset with small cleanup to convert those variables from
CamelCase to normal names.

Thanks

> +			(wdata->nr_pages-1)*PAGE_SIZE + wdata->tailsz;
> +		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
> +		if (need_invalidate)
> +			req->Channel = SMB2_CHANNEL_RDMA_V1;
> +		req->WriteChannelInfoOffset =
> +			offsetof(struct smb2_write_req, Buffer) - 4;
> +		req->WriteChannelInfoLength =
> +			sizeof(struct smbd_buffer_descriptor_v1);
> +		v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
> +		v1->offset = wdata->mr->mr->iova;
> +		v1->token = wdata->mr->mr->rkey;
> +		v1->length = wdata->mr->mr->length;
> +	}
> +
>  	/* 4 for rfc1002 length field and 1 for Buffer */
>  	iov[0].iov_len = 4;
>  	iov[0].iov_base = req;
> @@ -2729,10 +2765,17 @@ smb2_async_writev(struct cifs_writedata *wdata,
>  	rqst.rq_pagesz = wdata->pagesz;
>  	rqst.rq_tailsz = wdata->tailsz;
>
> +	if (wdata->mr) {
> +		iov[1].iov_len += sizeof(struct smbd_buffer_descriptor_v1);
> +		rqst.rq_npages = 0;
> +	}
> +
>  	cifs_dbg(FYI, "async write at %llu %u bytes\n",
>  		 wdata->offset, wdata->bytes);
>
> -	req->Length = cpu_to_le32(wdata->bytes);
> +	/* For RDMA read, I/O size is in RemainingBytes not in Length */
> +	if (!wdata->mr)
> +		req->Length = cpu_to_le32(wdata->bytes);
>
>  	inc_rfc1001_len(&req->hdr, wdata->bytes - 1 /* Buffer */);
>
> --
> 2.7.4
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
  2017-08-23 13:52   ` Leon Romanovsky
@ 2017-08-23 18:09         ` Long Li
  0 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-23 18:09 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox



> -----Original Message-----
> From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]
> Sent: Wednesday, August 23, 2017 6:52 AM
> To: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> Cc: Steve French <sfrench-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>; linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; samba-
> technical-w/Ol4Ecudpl8XjKLYN78aQ@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-
> rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>; Tom Talpey
> <ttalpey-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>; Matthew Wilcox <mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>;
> Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> read for SMB write
> 
> On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> > From: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> >
> > When sending I/O, if size is larger than rdma_readwrite_threshold we
> prepare to send SMB WRITE packet for a RDMA read via memory registration.
> The actual I/O is done out-of-the-band, so modify the relevant fields in the
> packet accordingly.
> >
> > Signed-off-by: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > ---
> >  fs/cifs/smb2pdu.c | 45
> ++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 44 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > 5cc5f6c..5581afd 100644
> > --- a/fs/cifs/smb2pdu.c
> > +++ b/fs/cifs/smb2pdu.c
> > @@ -48,6 +48,7 @@
> >  #include "smb2glob.h"
> >  #include "cifspdu.h"
> >  #include "cifs_spnego.h"
> > +#include "smbdirect.h"
> >
> >  /*
> >   *  The following table defines the expected "StructureSize" of SMB2
> > requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
> cifs_writedata *wdata,
> >  				offsetof(struct smb2_write_req, Buffer) - 4);
> >  	req->RemainingBytes = 0;
> >
> > +	/*
> > +	 * If we want to do a server RDMA read, fill in and append
> > +	 * smbd_buffer_descriptor_v1 to the end of write request
> > +	 */
> > +	if (server->rdma && wdata->bytes >
> > +		server->smbd_conn->rdma_readwrite_threshold) {
> > +
> > +		struct smbd_buffer_descriptor_v1 *v1;
> > +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> > +
> > +		wdata->mr = smbd_register_mr(
> > +				server->smbd_conn, wdata->pages,
> > +				wdata->nr_pages, wdata->tailsz,
> > +				false, need_invalidate);
> > +		if (!wdata->mr) {
> > +			rc = -ENOBUFS;
> > +			goto async_writev_out;
> > +		}
> > +		req->Length = 0;
> > +		req->DataOffset = 0;
> > +		req->RemainingBytes =
> 
> Wow, we have CamelCase variables in linux kernel. It will help if you start
> your patchset with small cleanup to convert those variables from CamelCase
> to normal names.

They are used everywhere in the upper layer code for packet definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and fs/cifs/cifspdu.h)

I suggest we do another cleanup patch to clean things up.

> 
> Thanks
> 
> > +			(wdata->nr_pages-1)*PAGE_SIZE + wdata->tailsz;
> > +		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
> > +		if (need_invalidate)
> > +			req->Channel = SMB2_CHANNEL_RDMA_V1;
> > +		req->WriteChannelInfoOffset =
> > +			offsetof(struct smb2_write_req, Buffer) - 4;
> > +		req->WriteChannelInfoLength =
> > +			sizeof(struct smbd_buffer_descriptor_v1);
> > +		v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
> > +		v1->offset = wdata->mr->mr->iova;
> > +		v1->token = wdata->mr->mr->rkey;
> > +		v1->length = wdata->mr->mr->length;
> > +	}
> > +
> >  	/* 4 for rfc1002 length field and 1 for Buffer */
> >  	iov[0].iov_len = 4;
> >  	iov[0].iov_base = req;
> > @@ -2729,10 +2765,17 @@ smb2_async_writev(struct cifs_writedata
> *wdata,
> >  	rqst.rq_pagesz = wdata->pagesz;
> >  	rqst.rq_tailsz = wdata->tailsz;
> >
> > +	if (wdata->mr) {
> > +		iov[1].iov_len += sizeof(struct smbd_buffer_descriptor_v1);
> > +		rqst.rq_npages = 0;
> > +	}
> > +
> >  	cifs_dbg(FYI, "async write at %llu %u bytes\n",
> >  		 wdata->offset, wdata->bytes);
> >
> > -	req->Length = cpu_to_le32(wdata->bytes);
> > +	/* For RDMA read, I/O size is in RemainingBytes not in Length */
> > +	if (!wdata->mr)
> > +		req->Length = cpu_to_le32(wdata->bytes);
> >
> >  	inc_rfc1001_len(&req->hdr, wdata->bytes - 1 /* Buffer */);
> >
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
@ 2017-08-23 18:09         ` Long Li
  0 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-23 18:09 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox



> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: Wednesday, August 23, 2017 6:52 AM
> To: Long Li <longli@microsoft.com>
> Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> technical@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
> <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>;
> Long Li <longli@microsoft.com>
> Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> read for SMB write
> 
> On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> > From: Long Li <longli@microsoft.com>
> >
> > When sending I/O, if size is larger than rdma_readwrite_threshold we
> prepare to send SMB WRITE packet for a RDMA read via memory registration.
> The actual I/O is done out-of-the-band, so modify the relevant fields in the
> packet accordingly.
> >
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> >  fs/cifs/smb2pdu.c | 45
> ++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 44 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > 5cc5f6c..5581afd 100644
> > --- a/fs/cifs/smb2pdu.c
> > +++ b/fs/cifs/smb2pdu.c
> > @@ -48,6 +48,7 @@
> >  #include "smb2glob.h"
> >  #include "cifspdu.h"
> >  #include "cifs_spnego.h"
> > +#include "smbdirect.h"
> >
> >  /*
> >   *  The following table defines the expected "StructureSize" of SMB2
> > requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
> cifs_writedata *wdata,
> >  				offsetof(struct smb2_write_req, Buffer) - 4);
> >  	req->RemainingBytes = 0;
> >
> > +	/*
> > +	 * If we want to do a server RDMA read, fill in and append
> > +	 * smbd_buffer_descriptor_v1 to the end of write request
> > +	 */
> > +	if (server->rdma && wdata->bytes >
> > +		server->smbd_conn->rdma_readwrite_threshold) {
> > +
> > +		struct smbd_buffer_descriptor_v1 *v1;
> > +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> > +
> > +		wdata->mr = smbd_register_mr(
> > +				server->smbd_conn, wdata->pages,
> > +				wdata->nr_pages, wdata->tailsz,
> > +				false, need_invalidate);
> > +		if (!wdata->mr) {
> > +			rc = -ENOBUFS;
> > +			goto async_writev_out;
> > +		}
> > +		req->Length = 0;
> > +		req->DataOffset = 0;
> > +		req->RemainingBytes =
> 
> Wow, we have CamelCase variables in linux kernel. It will help if you start
> your patchset with small cleanup to convert those variables from CamelCase
> to normal names.

They are used everywhere in the upper layer code for packet definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and fs/cifs/cifspdu.h)

I suggest we do another cleanup patch to clean things up.

> 
> Thanks
> 
> > +			(wdata->nr_pages-1)*PAGE_SIZE + wdata->tailsz;
> > +		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
> > +		if (need_invalidate)
> > +			req->Channel = SMB2_CHANNEL_RDMA_V1;
> > +		req->WriteChannelInfoOffset =
> > +			offsetof(struct smb2_write_req, Buffer) - 4;
> > +		req->WriteChannelInfoLength =
> > +			sizeof(struct smbd_buffer_descriptor_v1);
> > +		v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
> > +		v1->offset = wdata->mr->mr->iova;
> > +		v1->token = wdata->mr->mr->rkey;
> > +		v1->length = wdata->mr->mr->length;
> > +	}
> > +
> >  	/* 4 for rfc1002 length field and 1 for Buffer */
> >  	iov[0].iov_len = 4;
> >  	iov[0].iov_base = req;
> > @@ -2729,10 +2765,17 @@ smb2_async_writev(struct cifs_writedata
> *wdata,
> >  	rqst.rq_pagesz = wdata->pagesz;
> >  	rqst.rq_tailsz = wdata->tailsz;
> >
> > +	if (wdata->mr) {
> > +		iov[1].iov_len += sizeof(struct smbd_buffer_descriptor_v1);
> > +		rqst.rq_npages = 0;
> > +	}
> > +
> >  	cifs_dbg(FYI, "async write at %llu %u bytes\n",
> >  		 wdata->offset, wdata->bytes);
> >
> > -	req->Length = cpu_to_le32(wdata->bytes);
> > +	/* For RDMA read, I/O size is in RemainingBytes not in Length */
> > +	if (!wdata->mr)
> > +		req->Length = cpu_to_le32(wdata->bytes);
> >
> >  	inc_rfc1001_len(&req->hdr, wdata->bytes - 1 /* Buffer */);
> >
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
  2017-08-23 18:09         ` Long Li
@ 2017-08-23 19:02             ` Leon Romanovsky
  -1 siblings, 0 replies; 53+ messages in thread
From: Leon Romanovsky @ 2017-08-23 19:02 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 4092 bytes --]

On Wed, Aug 23, 2017 at 06:09:11PM +0000, Long Li wrote:
>
>
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]
> > Sent: Wednesday, August 23, 2017 6:52 AM
> > To: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > Cc: Steve French <sfrench-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>; linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; samba-
> > technical-w/Ol4Ecudpl8XjKLYN78aQ@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-
> > rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>; Tom Talpey
> > <ttalpey-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>; Matthew Wilcox <mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>;
> > Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> > read for SMB write
> >
> > On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> > > From: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > >
> > > When sending I/O, if size is larger than rdma_readwrite_threshold we
> > prepare to send SMB WRITE packet for a RDMA read via memory registration.
> > The actual I/O is done out-of-the-band, so modify the relevant fields in the
> > packet accordingly.
> > >
> > > Signed-off-by: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > > ---
> > >  fs/cifs/smb2pdu.c | 45
> > ++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 44 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > > 5cc5f6c..5581afd 100644
> > > --- a/fs/cifs/smb2pdu.c
> > > +++ b/fs/cifs/smb2pdu.c
> > > @@ -48,6 +48,7 @@
> > >  #include "smb2glob.h"
> > >  #include "cifspdu.h"
> > >  #include "cifs_spnego.h"
> > > +#include "smbdirect.h"
> > >
> > >  /*
> > >   *  The following table defines the expected "StructureSize" of SMB2
> > > requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
> > cifs_writedata *wdata,
> > >  				offsetof(struct smb2_write_req, Buffer) - 4);
> > >  	req->RemainingBytes = 0;
> > >
> > > +	/*
> > > +	 * If we want to do a server RDMA read, fill in and append
> > > +	 * smbd_buffer_descriptor_v1 to the end of write request
> > > +	 */
> > > +	if (server->rdma && wdata->bytes >
> > > +		server->smbd_conn->rdma_readwrite_threshold) {
> > > +
> > > +		struct smbd_buffer_descriptor_v1 *v1;
> > > +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> > > +
> > > +		wdata->mr = smbd_register_mr(
> > > +				server->smbd_conn, wdata->pages,
> > > +				wdata->nr_pages, wdata->tailsz,
> > > +				false, need_invalidate);
> > > +		if (!wdata->mr) {
> > > +			rc = -ENOBUFS;
> > > +			goto async_writev_out;
> > > +		}
> > > +		req->Length = 0;
> > > +		req->DataOffset = 0;
> > > +		req->RemainingBytes =
> >
> > Wow, we have CamelCase variables in linux kernel. It will help if you start
> > your patchset with small cleanup to convert those variables from CamelCase
> > to normal names.
>
> They are used everywhere in the upper layer code for packet definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and fs/cifs/cifspdu.h)

"everywhere" is a little bit over estimated in this case.
➜  linux-rdma git:(master) git grep RemainingBytes
fs/cifs/smb2pdu.c:              req->RemainingBytes = cpu_to_le32(remaining_bytes);
fs/cifs/smb2pdu.c:              req->RemainingBytes = 0;
fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
fs/cifs/smb2pdu.h:      __le32 RemainingBytes;

One simple "sed -i" will replace all them in one shot and it doesn't
look like undoable task.

>
> I suggest we do another cleanup patch to clean things up.

Yes, another cleanup patch is needed before your patches. You are adding
your code in 2017 and you are expected to follow present coding standards
like everyone else in the kernel.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
@ 2017-08-23 19:02             ` Leon Romanovsky
  0 siblings, 0 replies; 53+ messages in thread
From: Leon Romanovsky @ 2017-08-23 19:02 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 3755 bytes --]

On Wed, Aug 23, 2017 at 06:09:11PM +0000, Long Li wrote:
>
>
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@kernel.org]
> > Sent: Wednesday, August 23, 2017 6:52 AM
> > To: Long Li <longli@microsoft.com>
> > Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> > technical@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> > rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
> > <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>;
> > Long Li <longli@microsoft.com>
> > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> > read for SMB write
> >
> > On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> > > From: Long Li <longli@microsoft.com>
> > >
> > > When sending I/O, if size is larger than rdma_readwrite_threshold we
> > prepare to send SMB WRITE packet for a RDMA read via memory registration.
> > The actual I/O is done out-of-the-band, so modify the relevant fields in the
> > packet accordingly.
> > >
> > > Signed-off-by: Long Li <longli@microsoft.com>
> > > ---
> > >  fs/cifs/smb2pdu.c | 45
> > ++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 44 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > > 5cc5f6c..5581afd 100644
> > > --- a/fs/cifs/smb2pdu.c
> > > +++ b/fs/cifs/smb2pdu.c
> > > @@ -48,6 +48,7 @@
> > >  #include "smb2glob.h"
> > >  #include "cifspdu.h"
> > >  #include "cifs_spnego.h"
> > > +#include "smbdirect.h"
> > >
> > >  /*
> > >   *  The following table defines the expected "StructureSize" of SMB2
> > > requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
> > cifs_writedata *wdata,
> > >  				offsetof(struct smb2_write_req, Buffer) - 4);
> > >  	req->RemainingBytes = 0;
> > >
> > > +	/*
> > > +	 * If we want to do a server RDMA read, fill in and append
> > > +	 * smbd_buffer_descriptor_v1 to the end of write request
> > > +	 */
> > > +	if (server->rdma && wdata->bytes >
> > > +		server->smbd_conn->rdma_readwrite_threshold) {
> > > +
> > > +		struct smbd_buffer_descriptor_v1 *v1;
> > > +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> > > +
> > > +		wdata->mr = smbd_register_mr(
> > > +				server->smbd_conn, wdata->pages,
> > > +				wdata->nr_pages, wdata->tailsz,
> > > +				false, need_invalidate);
> > > +		if (!wdata->mr) {
> > > +			rc = -ENOBUFS;
> > > +			goto async_writev_out;
> > > +		}
> > > +		req->Length = 0;
> > > +		req->DataOffset = 0;
> > > +		req->RemainingBytes =
> >
> > Wow, we have CamelCase variables in linux kernel. It will help if you start
> > your patchset with small cleanup to convert those variables from CamelCase
> > to normal names.
>
> They are used everywhere in the upper layer code for packet definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and fs/cifs/cifspdu.h)

"everywhere" is a little bit over estimated in this case.
➜  linux-rdma git:(master) git grep RemainingBytes
fs/cifs/smb2pdu.c:              req->RemainingBytes = cpu_to_le32(remaining_bytes);
fs/cifs/smb2pdu.c:              req->RemainingBytes = 0;
fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
fs/cifs/smb2pdu.h:      __le32 RemainingBytes;

One simple "sed -i" will replace all them in one shot and it doesn't
look like undoable task.

>
> I suggest we do another cleanup patch to clean things up.

Yes, another cleanup patch is needed before your patches. You are adding
your code in 2017 and you are expected to follow present coding standards
like everyone else in the kernel.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
  2017-08-23 19:02             ` Leon Romanovsky
@ 2017-08-23 19:10                 ` Long Li
  -1 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-23 19:10 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox



> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: Wednesday, August 23, 2017 12:02 PM
> To: Long Li <longli@microsoft.com>
> Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> technical@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
> <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>
> Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> read for SMB write
> 
> On Wed, Aug 23, 2017 at 06:09:11PM +0000, Long Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > Sent: Wednesday, August 23, 2017 6:52 AM
> > > To: Long Li <longli@microsoft.com>
> > > Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org;
> > > samba- technical@lists.samba.org; linux-kernel@vger.kernel.org;
> > > linux- rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>;
> > > Tom Talpey <ttalpey@microsoft.com>; Matthew Wilcox
> > > <mawilcox@microsoft.com>; Long Li <longli@microsoft.com>
> > > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> > > read for SMB write
> > >
> > > On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> > > > From: Long Li <longli@microsoft.com>
> > > >
> > > > When sending I/O, if size is larger than rdma_readwrite_threshold
> > > > we
> > > prepare to send SMB WRITE packet for a RDMA read via memory
> registration.
> > > The actual I/O is done out-of-the-band, so modify the relevant
> > > fields in the packet accordingly.
> > > >
> > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > ---
> > > >  fs/cifs/smb2pdu.c | 45
> > > ++++++++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 44 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > > > 5cc5f6c..5581afd 100644
> > > > --- a/fs/cifs/smb2pdu.c
> > > > +++ b/fs/cifs/smb2pdu.c
> > > > @@ -48,6 +48,7 @@
> > > >  #include "smb2glob.h"
> > > >  #include "cifspdu.h"
> > > >  #include "cifs_spnego.h"
> > > > +#include "smbdirect.h"
> > > >
> > > >  /*
> > > >   *  The following table defines the expected "StructureSize" of
> > > > SMB2 requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
> > > cifs_writedata *wdata,
> > > >  				offsetof(struct smb2_write_req, Buffer) - 4);
> > > >  	req->RemainingBytes = 0;
> > > >
> > > > +	/*
> > > > +	 * If we want to do a server RDMA read, fill in and append
> > > > +	 * smbd_buffer_descriptor_v1 to the end of write request
> > > > +	 */
> > > > +	if (server->rdma && wdata->bytes >
> > > > +		server->smbd_conn->rdma_readwrite_threshold) {
> > > > +
> > > > +		struct smbd_buffer_descriptor_v1 *v1;
> > > > +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> > > > +
> > > > +		wdata->mr = smbd_register_mr(
> > > > +				server->smbd_conn, wdata->pages,
> > > > +				wdata->nr_pages, wdata->tailsz,
> > > > +				false, need_invalidate);
> > > > +		if (!wdata->mr) {
> > > > +			rc = -ENOBUFS;
> > > > +			goto async_writev_out;
> > > > +		}
> > > > +		req->Length = 0;
> > > > +		req->DataOffset = 0;
> > > > +		req->RemainingBytes =
> > >
> > > Wow, we have CamelCase variables in linux kernel. It will help if
> > > you start your patchset with small cleanup to convert those
> > > variables from CamelCase to normal names.
> >
> > They are used everywhere in the upper layer code for packet
> > definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and
> > fs/cifs/cifspdu.h)
> 
> "everywhere" is a little bit over estimated in this case.
> ➜  linux-rdma git:(master) git grep RemainingBytes
> fs/cifs/smb2pdu.c:              req->RemainingBytes =
> cpu_to_le32(remaining_bytes);
> fs/cifs/smb2pdu.c:              req->RemainingBytes = 0;
> fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
> fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
> fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
> fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
> 
> One simple "sed -i" will replace all them in one shot and it doesn't look like
> undoable task.

I mean cifspdu.h and smb2pdu.h. use CamelCase for all packet definitions. For example another one in smb2pdu.h:
struct smb2_negotiate_rsp {
        struct smb2_hdr hdr;
        __le16 StructureSize;   /* Must be 65 */
        __le16 SecurityMode;
        __le16 DialectRevision;
        __le16 NegotiateContextCount;   /* Prior to SMB3.1.1 was Reserved & MBZ */
        __u8   ServerGUID[16];
        __le32 Capabilities;
        __le32 MaxTransactSize;
        __le32 MaxReadSize;
        __le32 MaxWriteSize;
        __le64 SystemTime;      /* MBZ */
        __le64 ServerStartTime;
        __le16 SecurityBufferOffset;
        __le16 SecurityBufferLength;
        __le32 NegotiateContextOffset;  /* Pre:SMB3.1.1 was reserved/ignored */
        __u8   Buffer[1];       /* variable length GSS security buffer */
} __packed;

We may want to change them all together to keep naming consistent, that's a lot of changes.

> 
> >
> > I suggest we do another cleanup patch to clean things up.
> 
> Yes, another cleanup patch is needed before your patches. You are adding
> your code in 2017 and you are expected to follow present coding standards
> like everyone else in the kernel.
> 
> Thanks

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
@ 2017-08-23 19:10                 ` Long Li
  0 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-23 19:10 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox



> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: Wednesday, August 23, 2017 12:02 PM
> To: Long Li <longli@microsoft.com>
> Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> technical@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
> <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>
> Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> read for SMB write
> 
> On Wed, Aug 23, 2017 at 06:09:11PM +0000, Long Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > Sent: Wednesday, August 23, 2017 6:52 AM
> > > To: Long Li <longli@microsoft.com>
> > > Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org;
> > > samba- technical@lists.samba.org; linux-kernel@vger.kernel.org;
> > > linux- rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>;
> > > Tom Talpey <ttalpey@microsoft.com>; Matthew Wilcox
> > > <mawilcox@microsoft.com>; Long Li <longli@microsoft.com>
> > > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> > > read for SMB write
> > >
> > > On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> > > > From: Long Li <longli@microsoft.com>
> > > >
> > > > When sending I/O, if size is larger than rdma_readwrite_threshold
> > > > we
> > > prepare to send SMB WRITE packet for a RDMA read via memory
> registration.
> > > The actual I/O is done out-of-the-band, so modify the relevant
> > > fields in the packet accordingly.
> > > >
> > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > ---
> > > >  fs/cifs/smb2pdu.c | 45
> > > ++++++++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 44 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > > > 5cc5f6c..5581afd 100644
> > > > --- a/fs/cifs/smb2pdu.c
> > > > +++ b/fs/cifs/smb2pdu.c
> > > > @@ -48,6 +48,7 @@
> > > >  #include "smb2glob.h"
> > > >  #include "cifspdu.h"
> > > >  #include "cifs_spnego.h"
> > > > +#include "smbdirect.h"
> > > >
> > > >  /*
> > > >   *  The following table defines the expected "StructureSize" of
> > > > SMB2 requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
> > > cifs_writedata *wdata,
> > > >  				offsetof(struct smb2_write_req, Buffer) - 4);
> > > >  	req->RemainingBytes = 0;
> > > >
> > > > +	/*
> > > > +	 * If we want to do a server RDMA read, fill in and append
> > > > +	 * smbd_buffer_descriptor_v1 to the end of write request
> > > > +	 */
> > > > +	if (server->rdma && wdata->bytes >
> > > > +		server->smbd_conn->rdma_readwrite_threshold) {
> > > > +
> > > > +		struct smbd_buffer_descriptor_v1 *v1;
> > > > +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> > > > +
> > > > +		wdata->mr = smbd_register_mr(
> > > > +				server->smbd_conn, wdata->pages,
> > > > +				wdata->nr_pages, wdata->tailsz,
> > > > +				false, need_invalidate);
> > > > +		if (!wdata->mr) {
> > > > +			rc = -ENOBUFS;
> > > > +			goto async_writev_out;
> > > > +		}
> > > > +		req->Length = 0;
> > > > +		req->DataOffset = 0;
> > > > +		req->RemainingBytes =
> > >
> > > Wow, we have CamelCase variables in linux kernel. It will help if
> > > you start your patchset with small cleanup to convert those
> > > variables from CamelCase to normal names.
> >
> > They are used everywhere in the upper layer code for packet
> > definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and
> > fs/cifs/cifspdu.h)
> 
> "everywhere" is a little bit over estimated in this case.
> ➜  linux-rdma git:(master) git grep RemainingBytes
> fs/cifs/smb2pdu.c:              req->RemainingBytes =
> cpu_to_le32(remaining_bytes);
> fs/cifs/smb2pdu.c:              req->RemainingBytes = 0;
> fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
> fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
> fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
> fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
> 
> One simple "sed -i" will replace all them in one shot and it doesn't look like
> undoable task.

I mean cifspdu.h and smb2pdu.h. use CamelCase for all packet definitions. For example another one in smb2pdu.h:
struct smb2_negotiate_rsp {
        struct smb2_hdr hdr;
        __le16 StructureSize;   /* Must be 65 */
        __le16 SecurityMode;
        __le16 DialectRevision;
        __le16 NegotiateContextCount;   /* Prior to SMB3.1.1 was Reserved & MBZ */
        __u8   ServerGUID[16];
        __le32 Capabilities;
        __le32 MaxTransactSize;
        __le32 MaxReadSize;
        __le32 MaxWriteSize;
        __le64 SystemTime;      /* MBZ */
        __le64 ServerStartTime;
        __le16 SecurityBufferOffset;
        __le16 SecurityBufferLength;
        __le32 NegotiateContextOffset;  /* Pre:SMB3.1.1 was reserved/ignored */
        __u8   Buffer[1];       /* variable length GSS security buffer */
} __packed;

We may want to change them all together to keep naming consistent, that's a lot of changes.

> 
> >
> > I suggest we do another cleanup patch to clean things up.
> 
> Yes, another cleanup patch is needed before your patches. You are adding
> your code in 2017 and you are expected to follow present coding standards
> like everyone else in the kernel.
> 
> Thanks

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
  2017-08-23 19:10                 ` Long Li
@ 2017-08-23 19:23                     ` Leon Romanovsky
  -1 siblings, 0 replies; 53+ messages in thread
From: Leon Romanovsky @ 2017-08-23 19:23 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 6731 bytes --]

On Wed, Aug 23, 2017 at 07:10:38PM +0000, Long Li wrote:
>
>
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]
> > Sent: Wednesday, August 23, 2017 12:02 PM
> > To: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > Cc: Steve French <sfrench-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>; linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; samba-
> > technical-w/Ol4Ecudpl8XjKLYN78aQ@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-
> > rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>; Tom Talpey
> > <ttalpey-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>; Matthew Wilcox <mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> > read for SMB write
> >
> > On Wed, Aug 23, 2017 at 06:09:11PM +0000, Long Li wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]
> > > > Sent: Wednesday, August 23, 2017 6:52 AM
> > > > To: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > > > Cc: Steve French <sfrench-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>; linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > > > samba- technical-w/Ol4Ecudpl8XjKLYN78aQ@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > > > linux- rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>;
> > > > Tom Talpey <ttalpey-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>; Matthew Wilcox
> > > > <mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>; Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > > > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> > > > read for SMB write
> > > >
> > > > On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> > > > > From: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > > > >
> > > > > When sending I/O, if size is larger than rdma_readwrite_threshold
> > > > > we
> > > > prepare to send SMB WRITE packet for a RDMA read via memory
> > registration.
> > > > The actual I/O is done out-of-the-band, so modify the relevant
> > > > fields in the packet accordingly.
> > > > >
> > > > > Signed-off-by: Long Li <longli-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>
> > > > > ---
> > > > >  fs/cifs/smb2pdu.c | 45
> > > > ++++++++++++++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 44 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > > > > 5cc5f6c..5581afd 100644
> > > > > --- a/fs/cifs/smb2pdu.c
> > > > > +++ b/fs/cifs/smb2pdu.c
> > > > > @@ -48,6 +48,7 @@
> > > > >  #include "smb2glob.h"
> > > > >  #include "cifspdu.h"
> > > > >  #include "cifs_spnego.h"
> > > > > +#include "smbdirect.h"
> > > > >
> > > > >  /*
> > > > >   *  The following table defines the expected "StructureSize" of
> > > > > SMB2 requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
> > > > cifs_writedata *wdata,
> > > > >  				offsetof(struct smb2_write_req, Buffer) - 4);
> > > > >  	req->RemainingBytes = 0;
> > > > >
> > > > > +	/*
> > > > > +	 * If we want to do a server RDMA read, fill in and append
> > > > > +	 * smbd_buffer_descriptor_v1 to the end of write request
> > > > > +	 */
> > > > > +	if (server->rdma && wdata->bytes >
> > > > > +		server->smbd_conn->rdma_readwrite_threshold) {
> > > > > +
> > > > > +		struct smbd_buffer_descriptor_v1 *v1;
> > > > > +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> > > > > +
> > > > > +		wdata->mr = smbd_register_mr(
> > > > > +				server->smbd_conn, wdata->pages,
> > > > > +				wdata->nr_pages, wdata->tailsz,
> > > > > +				false, need_invalidate);
> > > > > +		if (!wdata->mr) {
> > > > > +			rc = -ENOBUFS;
> > > > > +			goto async_writev_out;
> > > > > +		}
> > > > > +		req->Length = 0;
> > > > > +		req->DataOffset = 0;
> > > > > +		req->RemainingBytes =
> > > >
> > > > Wow, we have CamelCase variables in linux kernel. It will help if
> > > > you start your patchset with small cleanup to convert those
> > > > variables from CamelCase to normal names.
> > >
> > > They are used everywhere in the upper layer code for packet
> > > definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and
> > > fs/cifs/cifspdu.h)
> >
> > "everywhere" is a little bit over estimated in this case.
> > ➜  linux-rdma git:(master) git grep RemainingBytes
> > fs/cifs/smb2pdu.c:              req->RemainingBytes =
> > cpu_to_le32(remaining_bytes);
> > fs/cifs/smb2pdu.c:              req->RemainingBytes = 0;
> > fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
> > fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
> > fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
> > fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
> >
> > One simple "sed -i" will replace all them in one shot and it doesn't look like
> > undoable task.
>
> I mean cifspdu.h and smb2pdu.h. use CamelCase for all packet definitions. For example another one in smb2pdu.h:
> struct smb2_negotiate_rsp {
>         struct smb2_hdr hdr;
>         __le16 StructureSize;   /* Must be 65 */
>         __le16 SecurityMode;
>         __le16 DialectRevision;
>         __le16 NegotiateContextCount;   /* Prior to SMB3.1.1 was Reserved & MBZ */
>         __u8   ServerGUID[16];
>         __le32 Capabilities;
>         __le32 MaxTransactSize;
>         __le32 MaxReadSize;
>         __le32 MaxWriteSize;
>         __le64 SystemTime;      /* MBZ */
>         __le64 ServerStartTime;
>         __le16 SecurityBufferOffset;
>         __le16 SecurityBufferLength;
>         __le32 NegotiateContextOffset;  /* Pre:SMB3.1.1 was reserved/ignored */
>         __u8   Buffer[1];       /* variable length GSS security buffer */
> } __packed;
>
> We may want to change them all together to keep naming consistent, that's a lot of changes.

Yes, but I'm not asking to change the structures which you are not
touching/using, concentrate on the ones used in your series.

It is great to have all these files in one coding style, but I agree with you that
it is not needed. I hope that your first patch with CC to kernel-janiators will
bring other people who will clean the rest.

Thanks

>
> >
> > >
> > > I suggest we do another cleanup patch to clean things up.
> >
> > Yes, another cleanup patch is needed before your patches. You are adding
> > your code in 2017 and you are expected to follow present coding standards
> > like everyone else in the kernel.
> >
> > Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
@ 2017-08-23 19:23                     ` Leon Romanovsky
  0 siblings, 0 replies; 53+ messages in thread
From: Leon Romanovsky @ 2017-08-23 19:23 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs, samba-technical, linux-kernel,
	linux-rdma, Christoph Hellwig, Tom Talpey, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 6135 bytes --]

On Wed, Aug 23, 2017 at 07:10:38PM +0000, Long Li wrote:
>
>
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@kernel.org]
> > Sent: Wednesday, August 23, 2017 12:02 PM
> > To: Long Li <longli@microsoft.com>
> > Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> > technical@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> > rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
> > <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>
> > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> > read for SMB write
> >
> > On Wed, Aug 23, 2017 at 06:09:11PM +0000, Long Li wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > > Sent: Wednesday, August 23, 2017 6:52 AM
> > > > To: Long Li <longli@microsoft.com>
> > > > Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org;
> > > > samba- technical@lists.samba.org; linux-kernel@vger.kernel.org;
> > > > linux- rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>;
> > > > Tom Talpey <ttalpey@microsoft.com>; Matthew Wilcox
> > > > <mawilcox@microsoft.com>; Long Li <longli@microsoft.com>
> > > > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
> > > > read for SMB write
> > > >
> > > > On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
> > > > > From: Long Li <longli@microsoft.com>
> > > > >
> > > > > When sending I/O, if size is larger than rdma_readwrite_threshold
> > > > > we
> > > > prepare to send SMB WRITE packet for a RDMA read via memory
> > registration.
> > > > The actual I/O is done out-of-the-band, so modify the relevant
> > > > fields in the packet accordingly.
> > > > >
> > > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > > ---
> > > > >  fs/cifs/smb2pdu.c | 45
> > > > ++++++++++++++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 44 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > > > > 5cc5f6c..5581afd 100644
> > > > > --- a/fs/cifs/smb2pdu.c
> > > > > +++ b/fs/cifs/smb2pdu.c
> > > > > @@ -48,6 +48,7 @@
> > > > >  #include "smb2glob.h"
> > > > >  #include "cifspdu.h"
> > > > >  #include "cifs_spnego.h"
> > > > > +#include "smbdirect.h"
> > > > >
> > > > >  /*
> > > > >   *  The following table defines the expected "StructureSize" of
> > > > > SMB2 requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
> > > > cifs_writedata *wdata,
> > > > >  				offsetof(struct smb2_write_req, Buffer) - 4);
> > > > >  	req->RemainingBytes = 0;
> > > > >
> > > > > +	/*
> > > > > +	 * If we want to do a server RDMA read, fill in and append
> > > > > +	 * smbd_buffer_descriptor_v1 to the end of write request
> > > > > +	 */
> > > > > +	if (server->rdma && wdata->bytes >
> > > > > +		server->smbd_conn->rdma_readwrite_threshold) {
> > > > > +
> > > > > +		struct smbd_buffer_descriptor_v1 *v1;
> > > > > +		bool need_invalidate = server->dialect == SMB30_PROT_ID;
> > > > > +
> > > > > +		wdata->mr = smbd_register_mr(
> > > > > +				server->smbd_conn, wdata->pages,
> > > > > +				wdata->nr_pages, wdata->tailsz,
> > > > > +				false, need_invalidate);
> > > > > +		if (!wdata->mr) {
> > > > > +			rc = -ENOBUFS;
> > > > > +			goto async_writev_out;
> > > > > +		}
> > > > > +		req->Length = 0;
> > > > > +		req->DataOffset = 0;
> > > > > +		req->RemainingBytes =
> > > >
> > > > Wow, we have CamelCase variables in linux kernel. It will help if
> > > > you start your patchset with small cleanup to convert those
> > > > variables from CamelCase to normal names.
> > >
> > > They are used everywhere in the upper layer code for packet
> > > definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and
> > > fs/cifs/cifspdu.h)
> >
> > "everywhere" is a little bit over estimated in this case.
> > ➜  linux-rdma git:(master) git grep RemainingBytes
> > fs/cifs/smb2pdu.c:              req->RemainingBytes =
> > cpu_to_le32(remaining_bytes);
> > fs/cifs/smb2pdu.c:              req->RemainingBytes = 0;
> > fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
> > fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
> > fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
> > fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
> >
> > One simple "sed -i" will replace all them in one shot and it doesn't look like
> > undoable task.
>
> I mean cifspdu.h and smb2pdu.h. use CamelCase for all packet definitions. For example another one in smb2pdu.h:
> struct smb2_negotiate_rsp {
>         struct smb2_hdr hdr;
>         __le16 StructureSize;   /* Must be 65 */
>         __le16 SecurityMode;
>         __le16 DialectRevision;
>         __le16 NegotiateContextCount;   /* Prior to SMB3.1.1 was Reserved & MBZ */
>         __u8   ServerGUID[16];
>         __le32 Capabilities;
>         __le32 MaxTransactSize;
>         __le32 MaxReadSize;
>         __le32 MaxWriteSize;
>         __le64 SystemTime;      /* MBZ */
>         __le64 ServerStartTime;
>         __le16 SecurityBufferOffset;
>         __le16 SecurityBufferLength;
>         __le32 NegotiateContextOffset;  /* Pre:SMB3.1.1 was reserved/ignored */
>         __u8   Buffer[1];       /* variable length GSS security buffer */
> } __packed;
>
> We may want to change them all together to keep naming consistent, that's a lot of changes.

Yes, but I'm not asking to change the structures which you are not
touching/using, concentrate on the ones used in your series.

It is great to have all these files in one coding style, but I agree with you that
it is not needed. I hope that your first patch with CC to kernel-janiators will
bring other people who will clean the rest.

Thanks

>
> >
> > >
> > > I suggest we do another cleanup patch to clean things up.
> >
> > Yes, another cleanup patch is needed before your patches. You are adding
> > your code in 2017 and you are expected to follow present coding standards
> > like everyone else in the kernel.
> >
> > Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write
  2017-08-23 19:10                 ` Long Li
  (?)
  (?)
@ 2017-08-23 19:39                 ` Steve French
  -1 siblings, 0 replies; 53+ messages in thread
From: Steve French @ 2017-08-23 19:39 UTC (permalink / raw)
  To: Long Li
  Cc: Leon Romanovsky, linux-cifs, linux-rdma, Matthew Wilcox,
	samba-technical, linux-kernel, Steve French

Note that Camel Case in cifs.ko source is largely used for protocol
definitions to match the official protocol documentation.  So we
expect Camel Case only where it is meant to express the **exact** name
of a field in the official specification of the wire protocol but
there may be some legacy code (unrelated to protocol wire format) that
still uses camel case - and that could be cleaned up with other
patches outside this series but is lower priority.

On Wed, Aug 23, 2017 at 2:10 PM, Long Li via samba-technical
<samba-technical@lists.samba.org> wrote:
>
>
>> -----Original Message-----
>> From: Leon Romanovsky [mailto:leon@kernel.org]
>> Sent: Wednesday, August 23, 2017 12:02 PM
>> To: Long Li <longli@microsoft.com>
>> Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
>> technical@lists.samba.org; linux-kernel@vger.kernel.org; linux-
>> rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
>> <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>
>> Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
>> read for SMB write
>>
>> On Wed, Aug 23, 2017 at 06:09:11PM +0000, Long Li wrote:
>> >
>> >
>> > > -----Original Message-----
>> > > From: Leon Romanovsky [mailto:leon@kernel.org]
>> > > Sent: Wednesday, August 23, 2017 6:52 AM
>> > > To: Long Li <longli@microsoft.com>
>> > > Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org;
>> > > samba- technical@lists.samba.org; linux-kernel@vger.kernel.org;
>> > > linux- rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>;
>> > > Tom Talpey <ttalpey@microsoft.com>; Matthew Wilcox
>> > > <mawilcox@microsoft.com>; Long Li <longli@microsoft.com>
>> > > Subject: Re: [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA
>> > > read for SMB write
>> > >
>> > > On Sun, Aug 20, 2017 at 12:04:37PM -0700, Long Li wrote:
>> > > > From: Long Li <longli@microsoft.com>
>> > > >
>> > > > When sending I/O, if size is larger than rdma_readwrite_threshold
>> > > > we
>> > > prepare to send SMB WRITE packet for a RDMA read via memory
>> registration.
>> > > The actual I/O is done out-of-the-band, so modify the relevant
>> > > fields in the packet accordingly.
>> > > >
>> > > > Signed-off-by: Long Li <longli@microsoft.com>
>> > > > ---
>> > > >  fs/cifs/smb2pdu.c | 45
>> > > ++++++++++++++++++++++++++++++++++++++++++++-
>> > > >  1 file changed, 44 insertions(+), 1 deletion(-)
>> > > >
>> > > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
>> > > > 5cc5f6c..5581afd 100644
>> > > > --- a/fs/cifs/smb2pdu.c
>> > > > +++ b/fs/cifs/smb2pdu.c
>> > > > @@ -48,6 +48,7 @@
>> > > >  #include "smb2glob.h"
>> > > >  #include "cifspdu.h"
>> > > >  #include "cifs_spnego.h"
>> > > > +#include "smbdirect.h"
>> > > >
>> > > >  /*
>> > > >   *  The following table defines the expected "StructureSize" of
>> > > > SMB2 requests @@ -2716,6 +2717,41 @@ smb2_async_writev(struct
>> > > cifs_writedata *wdata,
>> > > >                                 offsetof(struct smb2_write_req, Buffer) - 4);
>> > > >         req->RemainingBytes = 0;
>> > > >
>> > > > +       /*
>> > > > +        * If we want to do a server RDMA read, fill in and append
>> > > > +        * smbd_buffer_descriptor_v1 to the end of write request
>> > > > +        */
>> > > > +       if (server->rdma && wdata->bytes >
>> > > > +               server->smbd_conn->rdma_readwrite_threshold) {
>> > > > +
>> > > > +               struct smbd_buffer_descriptor_v1 *v1;
>> > > > +               bool need_invalidate = server->dialect == SMB30_PROT_ID;
>> > > > +
>> > > > +               wdata->mr = smbd_register_mr(
>> > > > +                               server->smbd_conn, wdata->pages,
>> > > > +                               wdata->nr_pages, wdata->tailsz,
>> > > > +                               false, need_invalidate);
>> > > > +               if (!wdata->mr) {
>> > > > +                       rc = -ENOBUFS;
>> > > > +                       goto async_writev_out;
>> > > > +               }
>> > > > +               req->Length = 0;
>> > > > +               req->DataOffset = 0;
>> > > > +               req->RemainingBytes =
>> > >
>> > > Wow, we have CamelCase variables in linux kernel. It will help if
>> > > you start your patchset with small cleanup to convert those
>> > > variables from CamelCase to normal names.
>> >
>> > They are used everywhere in the upper layer code for packet
>> > definitions, written a long time ago. (most in fs/cifs/smb2pdu.h and
>> > fs/cifs/cifspdu.h)
>>
>> "everywhere" is a little bit over estimated in this case.
>> ➜  linux-rdma git:(master) git grep RemainingBytes
>> fs/cifs/smb2pdu.c:              req->RemainingBytes =
>> cpu_to_le32(remaining_bytes);
>> fs/cifs/smb2pdu.c:              req->RemainingBytes = 0;
>> fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
>> fs/cifs/smb2pdu.c:      req->RemainingBytes = 0;
>> fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
>> fs/cifs/smb2pdu.h:      __le32 RemainingBytes;
>>
>> One simple "sed -i" will replace all them in one shot and it doesn't look like
>> undoable task.
>
> I mean cifspdu.h and smb2pdu.h. use CamelCase for all packet definitions. For example another one in smb2pdu.h:
> struct smb2_negotiate_rsp {
>         struct smb2_hdr hdr;
>         __le16 StructureSize;   /* Must be 65 */
>         __le16 SecurityMode;
>         __le16 DialectRevision;
>         __le16 NegotiateContextCount;   /* Prior to SMB3.1.1 was Reserved & MBZ */
>         __u8   ServerGUID[16];
>         __le32 Capabilities;
>         __le32 MaxTransactSize;
>         __le32 MaxReadSize;
>         __le32 MaxWriteSize;
>         __le64 SystemTime;      /* MBZ */
>         __le64 ServerStartTime;
>         __le16 SecurityBufferOffset;
>         __le16 SecurityBufferLength;
>         __le32 NegotiateContextOffset;  /* Pre:SMB3.1.1 was reserved/ignored */
>         __u8   Buffer[1];       /* variable length GSS security buffer */
> } __packed;
>
> We may want to change them all together to keep naming consistent, that's a lot of changes.
>
>>
>> >
>> > I suggest we do another cleanup patch to clean things up.
>>
>> Yes, another cleanup patch is needed before your patches. You are adding
>> your code in 2017 and you are expected to follow present coding standards
>> like everyone else in the kernel.
>>
>> Thanks



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
  2017-08-21 20:23         ` Long Li
@ 2017-08-29 18:10           ` Long Li
       [not found]             ` <MWHPR21MB0190050F3699CDF6F52B51A2CE9F0-saRRjQKJ25M/hL2NnenhuM1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Long Li @ 2017-08-29 18:10 UTC (permalink / raw)
  To: Long Li, Steve Wise, 'Steve French',
	linux-cifs, linux-kernel, linux-rdma, 'Christoph Hellwig',
	Tom Talpey, Matthew Wilcox



> -----Original Message-----
> From: samba-technical [mailto:samba-technical-bounces@lists.samba.org]
> On Behalf Of Long Li via samba-technical
> Sent: Monday, August 21, 2017 1:24 PM
> To: Steve Wise <swise@opengridcomputing.com>; 'Steve French'
> <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> technical@lists.samba.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; 'Christoph Hellwig' <hch@infradead.org>; Tom
> Talpey <ttalpey@microsoft.com>; Matthew Wilcox
> <mawilcox@microsoft.com>
> Subject: RE: [Patch v2 00/19] CIFS: Implement SMBDirect
> 
> > > > Hey Long,
> > > >
> > > > What testing have you done with this on the various rdma transports?
> > > > Does it work over IB, RoCE, and iWARP providers?
> > >
> > > Hi Steve,
> > >
> > > Currently all the tests have been done over Infiniband. We haven't
> > > tested on
> > RoCE
> > > or iWARP, but planned to do it in the following weeks.
> > >
> > > Long
> >
> > Ok, good.
> >
> > Is this series available on github or somewhere so we can clone it and
> > review it as it is applied to the kernel src?

I have put the patch v3 in the following location:
https://github.com/longlimsft/linux-next/tree/patch_v3

I will be sending it out soon. Please give it a try.

> 
> Unfortunately they are not on github. I will look into putting them there for
> review. Will update soon.
> 
> Thanks for helping out!
> 
> >
> > Thanks,
> >
> > Steve.
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 00/19] CIFS: Implement SMBDirect
  2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
@ 2017-08-29 18:20     ` Roland Dreier
  2017-08-20 19:04 ` [Patch v2 04/19] CIFS: SMBD: Add SMBDirect transport to SMB connection and Makefile Long Li
                       ` (16 subsequent siblings)
  17 siblings, 0 replies; 53+ messages in thread
From: Roland Dreier @ 2017-08-29 18:20 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ, LKML,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox, Long Li

> Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport protocol for transferring upper layer (SMB2) payload over RDMA via Infiniband, RoCE or iWARP. The prococol is published in [MS-SMBD] (https://msdn.microsoft.com/en-us/library/hh536346.aspx).

This is great to see.  Is there a Linux implementation of the server
side (in Samba?) so that the client can be tested without needing a
Windows server?

 - R.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-08-29 18:20     ` Roland Dreier
  0 siblings, 0 replies; 53+ messages in thread
From: Roland Dreier @ 2017-08-29 18:20 UTC (permalink / raw)
  To: Long Li
  Cc: Steve French, linux-cifs, samba-technical, LKML, linux-rdma,
	Christoph Hellwig, Tom Talpey, Matthew Wilcox, Long Li

> Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport protocol for transferring upper layer (SMB2) payload over RDMA via Infiniband, RoCE or iWARP. The prococol is published in [MS-SMBD] (https://msdn.microsoft.com/en-us/library/hh536346.aspx).

This is great to see.  Is there a Linux implementation of the server
side (in Samba?) so that the client can be tested without needing a
Windows server?

 - R.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
  2017-08-29 18:20     ` Roland Dreier
@ 2017-08-29 19:31         ` Long Li
  -1 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-29 19:31 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Steve French, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ, LKML,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Tom Talpey,
	Matthew Wilcox



> -----Original Message-----
> From: Roland Dreier [mailto:roland@purestorage.com]
> Sent: Tuesday, August 29, 2017 11:21 AM
> To: Long Li <longli@microsoft.com>
> Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> technical@lists.samba.org; LKML <linux-kernel@vger.kernel.org>; linux-
> rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
> <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>;
> Long Li <longli@microsoft.com>
> Subject: Re: [Patch v2 00/19] CIFS: Implement SMBDirect
> 
> > Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport
> protocol for transferring upper layer (SMB2) payload over RDMA via
> Infiniband, RoCE or iWARP. The prococol is published in [MS-SMBD]
> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmsdn
> .microsoft.com%2Fen-
> us%2Flibrary%2Fhh536346.aspx&data=02%7C01%7Clongli%40microsoft.com
> %7C9579a8e546ca4da923a708d4ef0abb51%7C72f988bf86f141af91ab2d7cd011
> db47%7C1%7C0%7C636396276752277123&sdata=FUX9d5ru6f%2B2ZqR%2BJd
> XpyIr%2BLf9n8EpuGYhyOgaU96c%3D&reserved=0).
> 
> This is great to see.  Is there a Linux implementation of the server side (in
> Samba?) so that the client can be tested without needing a Windows server?

I'm not aware of a Linux implementation on server side.

Currently it can be tested with Windows Server 2012, Server 2012 R2 and Server 2016.

> 
>  - R.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-08-29 19:31         ` Long Li
  0 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-08-29 19:31 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Steve French, linux-cifs, samba-technical, LKML, linux-rdma,
	Christoph Hellwig, Tom Talpey, Matthew Wilcox



> -----Original Message-----
> From: Roland Dreier [mailto:roland@purestorage.com]
> Sent: Tuesday, August 29, 2017 11:21 AM
> To: Long Li <longli@microsoft.com>
> Cc: Steve French <sfrench@samba.org>; linux-cifs@vger.kernel.org; samba-
> technical@lists.samba.org; LKML <linux-kernel@vger.kernel.org>; linux-
> rdma@vger.kernel.org; Christoph Hellwig <hch@infradead.org>; Tom Talpey
> <ttalpey@microsoft.com>; Matthew Wilcox <mawilcox@microsoft.com>;
> Long Li <longli@microsoft.com>
> Subject: Re: [Patch v2 00/19] CIFS: Implement SMBDirect
> 
> > Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport
> protocol for transferring upper layer (SMB2) payload over RDMA via
> Infiniband, RoCE or iWARP. The prococol is published in [MS-SMBD]
> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmsdn
> .microsoft.com%2Fen-
> us%2Flibrary%2Fhh536346.aspx&data=02%7C01%7Clongli%40microsoft.com
> %7C9579a8e546ca4da923a708d4ef0abb51%7C72f988bf86f141af91ab2d7cd011
> db47%7C1%7C0%7C636396276752277123&sdata=FUX9d5ru6f%2B2ZqR%2BJd
> XpyIr%2BLf9n8EpuGYhyOgaU96c%3D&reserved=0).
> 
> This is great to see.  Is there a Linux implementation of the server side (in
> Samba?) so that the client can be tested without needing a Windows server?

I'm not aware of a Linux implementation on server side.

Currently it can be tested with Windows Server 2012, Server 2012 R2 and Server 2016.

> 
>  - R.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 00/19] CIFS: Implement SMBDirect
  2017-08-29 19:31         ` Long Li
@ 2017-08-30  5:16             ` Stefan Metzmacher
  -1 siblings, 0 replies; 53+ messages in thread
From: Stefan Metzmacher @ 2017-08-30  5:16 UTC (permalink / raw)
  To: Long Li, Roland Dreier
  Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ, LKML, Steve French


[-- Attachment #1.1: Type: text/plain, Size: 823 bytes --]

Hi,

>> This is great to see.  Is there a Linux implementation of the server side (in
>> Samba?) so that the client can be tested without needing a Windows server?
> 
> I'm not aware of a Linux implementation on server side.

Here's a very early work in progress branch:
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-rdma
It only explores how the protocol works, as it uses a userspace
smb-direct proxy (which works around the missing fork support of the
userspace libibverbs), which makes it really slow, but is required
in order to have the code tested in Samba's autobuild.

I think once this code lands in the kernel tree, we'll be able to
arrange a userspace api to it, in order to make a useful implementation,
so we can skip the userspace smb-dorect proxy.

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-08-30  5:16             ` Stefan Metzmacher
  0 siblings, 0 replies; 53+ messages in thread
From: Stefan Metzmacher @ 2017-08-30  5:16 UTC (permalink / raw)
  To: Long Li, Roland Dreier
  Cc: linux-cifs, linux-rdma, Matthew Wilcox, samba-technical, LKML,
	Steve French


[-- Attachment #1.1: Type: text/plain, Size: 823 bytes --]

Hi,

>> This is great to see.  Is there a Linux implementation of the server side (in
>> Samba?) so that the client can be tested without needing a Windows server?
> 
> I'm not aware of a Linux implementation on server side.

Here's a very early work in progress branch:
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-rdma
It only explores how the protocol works, as it uses a userspace
smb-direct proxy (which works around the missing fork support of the
userspace libibverbs), which makes it really slow, but is required
in order to have the code tested in Samba's autobuild.

I think once this code lands in the kernel tree, we'll be able to
arrange a userspace api to it, in order to make a useful implementation,
so we can skip the userspace smb-dorect proxy.

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
       [not found]             ` <MWHPR21MB0190050F3699CDF6F52B51A2CE9F0-saRRjQKJ25M/hL2NnenhuM1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-09-05 16:30                 ` Steve Wise
@ 2017-09-05 16:30                 ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2017-09-05 16:30 UTC (permalink / raw)
  To: 'Long Li', 'Steve French',
	linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Christoph Hellwig',
	'Tom Talpey', 'Matthew Wilcox'
  Cc: 'Venkatesh Pottem'

> I have put the patch v3 in the following location:
> https://github.com/longlimsft/linux-next/tree/patch_v3
> 
> I will be sending it out soon. Please give it a try.
> 

Hey Long, how do I request a CIFS RDMA mount from the Linux client?  Is
there a mount.cifs option?  If so, where can I get the mount.cifs code that
enables this?  Or is there some other way? 

Thanks,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-09-05 16:30                 ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2017-09-05 16:30 UTC (permalink / raw)
  To: 'Long Li', 'Steve French',
	linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Christoph Hellwig',
	'Tom Talpey', 'Matthew Wilcox'
  Cc: 'Venkatesh Pottem'

> I have put the patch v3 in the following location:
> https://github.com/longlimsft/linux-next/tree/patch_v3
> 
> I will be sending it out soon. Please give it a try.
> 

Hey Long, how do I request a CIFS RDMA mount from the Linux client?  Is
there a mount.cifs option?  If so, where can I get the mount.cifs code that
enables this?  Or is there some other way? 

Thanks,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
@ 2017-09-05 16:30                 ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2017-09-05 16:30 UTC (permalink / raw)
  To: 'Long Li', 'Steve French',
	linux-cifs, linux-kernel, linux-rdma, 'Christoph Hellwig',
	'Tom Talpey', 'Matthew Wilcox'
  Cc: 'Venkatesh Pottem'

> I have put the patch v3 in the following location:
> https://github.com/longlimsft/linux-next/tree/patch_v3
> 
> I will be sending it out soon. Please give it a try.
> 

Hey Long, how do I request a CIFS RDMA mount from the Linux client?  Is
there a mount.cifs option?  If so, where can I get the mount.cifs code that
enables this?  Or is there some other way? 

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Patch v2 00/19] CIFS: Implement SMBDirect
  2017-09-05 16:30                 ` Steve Wise
  (?)
  (?)
@ 2017-09-05 17:42                 ` Long Li
  -1 siblings, 0 replies; 53+ messages in thread
From: Long Li @ 2017-09-05 17:42 UTC (permalink / raw)
  To: Steve Wise, 'Steve French',
	linux-cifs, linux-kernel, linux-rdma, 'Christoph Hellwig',
	Tom Talpey, Matthew Wilcox
  Cc: 'Venkatesh Pottem'

> Hey Long, how do I request a CIFS RDMA mount from the Linux client?  Is
> there a mount.cifs option?  If so, where can I get the mount.cifs code that
> enables this?  Or is there some other way?

You can use "-o rdma" in the mount option to connect to RDMA.
For example: "mount.cifs -o rdma,vers=3.02"

The change to the mount option is in the patch series.

> 
> Thanks,
> 
> Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2017-09-05 17:42 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-20 19:04 [Patch v2 00/19] CIFS: Implement SMBDirect Long Li
2017-08-20 19:04 ` [Patch v2 03/19] CIFS: SMBD: " Long Li
2017-08-20 19:04 ` [Patch v2 04/19] CIFS: SMBD: Add SMBDirect transport to SMB connection and Makefile Long Li
2017-08-20 19:04 ` [Patch v2 05/19] CIFS: SMBD: Connect to SMBDirect session Long Li
2017-08-20 19:04 ` [Patch v2 06/19] CIFS: SMBD: Reconnect " Long Li
2017-08-20 19:04 ` [Patch v2 07/19] CIFS: SMBD: Destroy SMBDirect session on shutdown or umount Long Li
2017-08-20 19:04 ` [Patch v2 08/19] CIFS: SMBD: Set SMBDirect maximum read or write size for I/O Long Li
2017-08-20 19:04 ` [Patch v2 09/19] CIFS: SMBD: Read data from SMBDirect Long Li
2017-08-20 19:04 ` [Patch v2 10/19] CIFS: SMBD: Send data through SMBDirect Long Li
2017-08-20 19:04 ` [Patch v2 11/19] CIFS: SMBD: Define memory registration for I/O data Long Li
2017-08-20 19:04 ` [Patch v2 12/19] CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE Long Li
2017-08-20 19:04 ` [Patch v2 13/19] CIFS: SMBD: Use registered memory RDMA read for SMB write Long Li
2017-08-23 13:52   ` Leon Romanovsky
     [not found]     ` <20170823135200.GP1724-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-08-23 18:09       ` Long Li
2017-08-23 18:09         ` Long Li
     [not found]         ` <MWHPR21MB0190DBBDE3317D973FEF0E43CE850-saRRjQKJ25M/hL2NnenhuM1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-08-23 19:02           ` Leon Romanovsky
2017-08-23 19:02             ` Leon Romanovsky
     [not found]             ` <20170823190214.GY1724-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-08-23 19:10               ` Long Li
2017-08-23 19:10                 ` Long Li
     [not found]                 ` <MWHPR21MB0190DDCDB0B5F8FED3265A60CE850-saRRjQKJ25M/hL2NnenhuM1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-08-23 19:23                   ` Leon Romanovsky
2017-08-23 19:23                     ` Leon Romanovsky
2017-08-23 19:39                 ` Steve French
2017-08-20 19:04 ` [Patch v2 14/19] CIFS: SMBD: Deregister memory when finishing " Long Li
2017-08-20 19:04 ` [Patch v2 15/19] CIFS: SMBD: Add parameter rdata to smb2_new_read_req Long Li
2017-08-20 19:04 ` [Patch v2 16/19] CIFS: SMBD: Read correct returned data length for RDMA write (SMB READ) I/O Long Li
2017-08-20 19:04 ` [Patch v2 17/19] CIFS: SMBD: Implement SMB READ via RDMA write through memory registration Long Li
2017-08-20 19:04 ` [Patch v2 18/19] CIFS: SMBD: Deregister memory when finishing SMB READ Long Li
2017-08-20 19:04 ` [Patch v2 19/19] CIFS: SMBD: Add SMBDirect debug counters Long Li
     [not found] ` <1503255883-3041-1-git-send-email-longli-Lp/cVzEoVyZiJJESP9tAQJZ3qXmFLfmx@public.gmane.org>
2017-08-20 19:04   ` [Patch v2 01/19] CIFS: Add RDMA mount option Long Li
2017-08-20 19:04     ` Long Li
     [not found]     ` <1503255883-3041-2-git-send-email-longli-Lp/cVzEoVyZiJJESP9tAQJZ3qXmFLfmx@public.gmane.org>
2017-08-21  4:36       ` Leon Romanovsky
2017-08-21  4:36         ` Leon Romanovsky
2017-08-21 18:18         ` Long Li
2017-08-20 19:04   ` [Patch v2 02/19] CIFS: SMBD: Add SMBDirect protocol and transport constants Long Li
2017-08-20 19:04     ` Long Li
2017-08-21 19:15   ` [Patch v2 00/19] CIFS: Implement SMBDirect Steve Wise
2017-08-21 19:15     ` Steve Wise
2017-08-21 19:15     ` Steve Wise
2017-08-21 19:50     ` Long Li
2017-08-21 19:56       ` Steve Wise
2017-08-21 19:56         ` Steve Wise
2017-08-21 20:23         ` Long Li
2017-08-29 18:10           ` Long Li
     [not found]             ` <MWHPR21MB0190050F3699CDF6F52B51A2CE9F0-saRRjQKJ25M/hL2NnenhuM1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-09-05 16:30               ` Steve Wise
2017-09-05 16:30                 ` Steve Wise
2017-09-05 16:30                 ` Steve Wise
2017-09-05 17:42                 ` Long Li
2017-08-29 18:20   ` Roland Dreier
2017-08-29 18:20     ` Roland Dreier
     [not found]     ` <CAL1RGDUGiOqjB8n0mJmF079jm7vXrdzE2rHdzoNOexvgHScjUQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-08-29 19:31       ` Long Li
2017-08-29 19:31         ` Long Li
     [not found]         ` <CY4PR21MB01821B704C2F64F72A8A92A9CE9F0-kUhI0YP1syo7ifcEnHlXec1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-08-30  5:16           ` Stefan Metzmacher
2017-08-30  5:16             ` Stefan Metzmacher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.