All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission
@ 2010-09-02 18:00 Fred Isaman
  2010-09-02 18:00 ` [PATCH 01/13] NFSD: remove duplicate NFS4_STATEID_SIZE Fred Isaman
                   ` (12 more replies)
  0 siblings, 13 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

This is the start of code implementing pnfs, based on RFC 5661.  Since
sending the whole thing at once would be overwhelming, we are trying
to break it into bite sized chunks.  This chunk implements the
mount/umount infrastructure, as well as sending the LAYOUTGET and
GETDEVTICEINFO calls on io (but not actually using the information for
io).  Note that two major simplifications to the protocol will be made 
throughout the initial submission process:  only the file layout
driver is considered, and only whole file layouts are requested.


These patches apply against Trond's nfs-for-2.6.37 branch.

patches 01-08 implement the mount/umount hooks
patches 09-13 implement LAYOUTGET and GETDEVICEINFO


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/13] NFSD: remove duplicate NFS4_STATEID_SIZE
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-02 18:00 ` [PATCH 02/13] SUNRPC: define xdr_decode_opaque_fixed Fred Isaman
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: Andy Adamson <andros@netapp.com>

Already accepted by Bruce
---
 fs/nfsd/nfs4callback.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 988cbb3..014482c 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -41,7 +41,6 @@
 
 #define NFSPROC4_CB_NULL 0
 #define NFSPROC4_CB_COMPOUND 1
-#define NFS4_STATEID_SIZE 16
 
 /* Index of predefined Linux callback client operations */
 
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/13] SUNRPC: define xdr_decode_opaque_fixed
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
  2010-09-02 18:00 ` [PATCH 01/13] NFSD: remove duplicate NFS4_STATEID_SIZE Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-02 18:00 ` [PATCH 03/13] RFC: pnfsd, pnfs: protocol level pnfs constants Fred Isaman
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: Benny Halevy <bhalevy@panasas.com>

A helper for decoding a fixed length opaque value.
Returns a pointer to the next item in the xdr stream.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 include/linux/sunrpc/xdr.h |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index 35cf2e8..23dc117 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -131,6 +131,13 @@ xdr_decode_hyper(__be32 *p, __u64 *valp)
 	return p + 2;
 }
 
+static inline __be32 *
+xdr_decode_opaque_fixed(__be32 *p, void *ptr, unsigned int len)
+{
+	memcpy(ptr, p, len);
+	return p + XDR_QUADLEN(len);
+}
+
 /*
  * Adjust kvec to reflect end of xdr'ed data (RPC client XDR)
  */
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/13] RFC: pnfsd, pnfs: protocol level pnfs constants
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
  2010-09-02 18:00 ` [PATCH 01/13] NFSD: remove duplicate NFS4_STATEID_SIZE Fred Isaman
  2010-09-02 18:00 ` [PATCH 02/13] SUNRPC: define xdr_decode_opaque_fixed Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-02 18:00 ` [PATCH 04/13] RFC: nfs: change stateid to be a union Fred Isaman
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

Use only layoutreturn constant for both returns and recalls.
(return_* works better for recall_type rather the other way around)

Signed-off-by: TBD - melding/reorganization of several patches
---
 include/linux/nfs4.h |   39 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 07e40c6..6798fc3 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -471,6 +471,8 @@ enum lock_type4 {
 #define FATTR4_WORD1_TIME_MODIFY        (1UL << 21)
 #define FATTR4_WORD1_TIME_MODIFY_SET    (1UL << 22)
 #define FATTR4_WORD1_MOUNTED_ON_FILEID  (1UL << 23)
+#define FATTR4_WORD1_FS_LAYOUT_TYPES    (1UL << 30)
+#define FATTR4_WORD2_LAYOUT_BLKSIZE     (1UL << 1)
 
 #define NFSPROC4_NULL 0
 #define NFSPROC4_COMPOUND 1
@@ -550,6 +552,43 @@ enum state_protect_how4 {
 	SP4_SSV		= 2
 };
 
+enum pnfs_layouttype {
+	LAYOUT_NFSV4_1_FILES  = 1,
+	LAYOUT_OSD2_OBJECTS = 2,
+	LAYOUT_BLOCK_VOLUME = 3,
+};
+
+/* used for both layout return and recall */
+enum pnfs_layoutreturn_type {
+	RETURN_FILE = 1,
+	RETURN_FSID = 2,
+	RETURN_ALL  = 3
+};
+
+enum pnfs_iomode {
+	IOMODE_READ = 1,
+	IOMODE_RW = 2,
+	IOMODE_ANY = 3,
+};
+
+enum pnfs_notify_deviceid_type4 {
+	NOTIFY_DEVICEID4_CHANGE = 1 << 1,
+	NOTIFY_DEVICEID4_DELETE = 1 << 2,
+};
+
+#define NFL4_UFLG_MASK			0x0000003F
+#define NFL4_UFLG_DENSE			0x00000001
+#define NFL4_UFLG_COMMIT_THRU_MDS	0x00000002
+#define NFL4_UFLG_STRIPE_UNIT_SIZE_MASK	0xFFFFFFC0
+
+/* Encoded in the loh_body field of type layouthint4 */
+enum filelayout_hint_care4 {
+	NFLH4_CARE_DENSE		= NFL4_UFLG_DENSE,
+	NFLH4_CARE_COMMIT_THRU_MDS	= NFL4_UFLG_COMMIT_THRU_MDS,
+	NFLH4_CARE_STRIPE_UNIT_SIZE	= 0x00000040,
+	NFLH4_CARE_STRIPE_COUNT		= 0x00000080
+};
+
 #endif
 #endif
 
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/13] RFC: nfs: change stateid to be a union
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (2 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 03/13] RFC: pnfsd, pnfs: protocol level pnfs constants Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-02 18:00 ` [PATCH 05/13] RFC: nfs: ask for layouttypes during fsinfo call Fred Isaman
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

In NFSv4.1 the stateid consists of the other and seqid fields. For layout
processing we need to numerically compare the seqid value of layout stateids.
To do so, introduce a union to nfs4_stateid to swtich between opaque(16 bytes)
and opaque(12 bytes) / __be32

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
---
 fs/nfs/callback_proc.c |    8 ++++----
 include/linux/nfs4.h   |   15 +++++++++++++--
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 930d10f..2950fca 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -118,11 +118,11 @@ int nfs41_validate_delegation_stateid(struct nfs_delegation *delegation, const n
 	if (delegation == NULL)
 		return 0;
 
-	/* seqid is 4-bytes long */
-	if (((u32 *) &stateid->data)[0] != 0)
+	if (stateid->stateid.seqid != 0)
 		return 0;
-	if (memcmp(&delegation->stateid.data[4], &stateid->data[4],
-		   sizeof(stateid->data)-4))
+	if (memcmp(&delegation->stateid.stateid.other,
+		   &stateid->stateid.other,
+		   NFS4_STATEID_OTHER_SIZE))
 		return 0;
 
 	return 1;
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 6798fc3..2dde7c8 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -17,7 +17,9 @@
 
 #define NFS4_BITMAP_SIZE	2
 #define NFS4_VERIFIER_SIZE	8
-#define NFS4_STATEID_SIZE	16
+#define NFS4_STATEID_SEQID_SIZE 4
+#define NFS4_STATEID_OTHER_SIZE 12
+#define NFS4_STATEID_SIZE	(NFS4_STATEID_SEQID_SIZE + NFS4_STATEID_OTHER_SIZE)
 #define NFS4_FHSIZE		128
 #define NFS4_MAXPATHLEN		PATH_MAX
 #define NFS4_MAXNAMLEN		NAME_MAX
@@ -167,7 +169,16 @@ struct nfs4_acl {
 };
 
 typedef struct { char data[NFS4_VERIFIER_SIZE]; } nfs4_verifier;
-typedef struct { char data[NFS4_STATEID_SIZE]; } nfs4_stateid;
+
+struct nfs41_stateid {
+	__be32 seqid;
+	char other[NFS4_STATEID_OTHER_SIZE];
+} __attribute__ ((packed));
+
+typedef union {
+	char data[NFS4_STATEID_SIZE];
+	struct nfs41_stateid stateid;
+} nfs4_stateid;
 
 enum nfs_opnum4 {
 	OP_ACCESS = 3,
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/13] RFC: nfs: ask for layouttypes during fsinfo call
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (3 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 04/13] RFC: nfs: change stateid to be a union Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-02 18:00 ` [PATCH 06/13] RFC: nfs: set layout driver Fred Isaman
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

This information will be used to determine which layout driver,
if any, to use for subsequent IO on this filesystem.  Each driver
is assigned an integer id, with 0 reserved to indicate no driver.

The server can in theory return multiple ids.  However, our current
client implementation only notes the first entry and ignores the
rest.

Signed-off-by: TBD - melding/reorganization of several patches
---
 fs/nfs/nfs4proc.c       |    2 +-
 fs/nfs/nfs4xdr.c        |   57 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/nfs_xdr.h |    1 +
 3 files changed, 59 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 089da5b..9e6b086 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -129,7 +129,7 @@ const u32 nfs4_fsinfo_bitmap[2] = { FATTR4_WORD0_MAXFILESIZE
 			| FATTR4_WORD0_MAXREAD
 			| FATTR4_WORD0_MAXWRITE
 			| FATTR4_WORD0_LEASE_TIME,
-			0
+			FATTR4_WORD1_FS_LAYOUT_TYPES
 };
 
 const u32 nfs4_fs_locations_bitmap[2] = {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 08ef912..60233ae 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -3868,6 +3868,60 @@ xdr_error:
 	return status;
 }
 
+/*
+ * Decode potentially multiple layout types. Currently we only support
+ * one layout driver per file system.
+ */
+static int decode_first_pnfs_layout_type(struct xdr_stream *xdr,
+					 uint32_t *layouttype)
+{
+	uint32_t *p;
+	int num;
+
+	p = xdr_inline_decode(xdr, 4);
+	if (unlikely(!p))
+		goto out_overflow;
+	num = be32_to_cpup(p);
+
+	/* pNFS is not supported by the underlying file system */
+	if (num == 0) {
+		*layouttype = 0;
+		return 0;
+	}
+	if (num > 1)
+		printk(KERN_INFO "%s: Warning: Multiple pNFS layout drivers "
+			"per filesystem not supported\n", __func__);
+
+	/* Decode and set first layout type, move xdr->p past unused types */
+	p = xdr_inline_decode(xdr, num * 4);
+	if (unlikely(!p))
+		goto out_overflow;
+	*layouttype = be32_to_cpup(p);
+	return 0;
+out_overflow:
+	print_overflow_msg(__func__, xdr);
+	return -EIO;
+}
+
+/*
+ * The type of file system exported.
+ * Note we must ensure that layouttype is set in any non-error case.
+ */
+static int decode_attr_pnfstype(struct xdr_stream *xdr, uint32_t *bitmap,
+				uint32_t *layouttype)
+{
+	int status = 0;
+
+	dprintk("%s: bitmap is %x\n", __func__, bitmap[1]);
+	if (unlikely(bitmap[1] & (FATTR4_WORD1_FS_LAYOUT_TYPES - 1U)))
+		return -EIO;
+	if (bitmap[1] & FATTR4_WORD1_FS_LAYOUT_TYPES) {
+		status = decode_first_pnfs_layout_type(xdr, layouttype);
+		bitmap[1] &= ~FATTR4_WORD1_FS_LAYOUT_TYPES;
+	} else
+		*layouttype = 0;
+	return status;
+}
 
 static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
 {
@@ -3894,6 +3948,9 @@ static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
 	if ((status = decode_attr_maxwrite(xdr, bitmap, &fsinfo->wtmax)) != 0)
 		goto xdr_error;
 	fsinfo->wtpref = fsinfo->wtmax;
+	status = decode_attr_pnfstype(xdr, bitmap, &fsinfo->layouttype);
+	if (status)
+		goto xdr_error;
 
 	status = verify_attr_len(xdr, savep, attrlen);
 xdr_error:
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index fc46192..8a2c228 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -113,6 +113,7 @@ struct nfs_fsinfo {
 	__u32			dtpref;	/* pref. readdir transfer size */
 	__u64			maxfilesize;
 	__u32			lease_time; /* in seconds */
+	__u32			layouttype; /* supported pnfs layout driver */
 };
 
 struct nfs_fsstat {
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/13] RFC: nfs: set layout driver
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (4 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 05/13] RFC: nfs: ask for layouttypes during fsinfo call Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-02 18:00 ` [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure Fred Isaman
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

Put in the infrastructure that uses information returned from the
server at mount to select a layout driver module.

In this patch, a stub is used that always returns "no driver found".

Signed-off-by: TBD - melding/reorganization of several patches
---
 fs/nfs/Makefile           |    1 +
 fs/nfs/client.c           |    4 ++
 fs/nfs/pnfs.c             |   78 +++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/pnfs.h             |   36 +++++++++++++++++++++
 include/linux/nfs_fs.h    |    1 +
 include/linux/nfs_fs_sb.h |    1 +
 6 files changed, 121 insertions(+), 0 deletions(-)
 create mode 100644 fs/nfs/pnfs.c
 create mode 100644 fs/nfs/pnfs.h

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index da7fda6..bb9e773 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -15,5 +15,6 @@ nfs-$(CONFIG_NFS_V4)	+= nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
 			   delegation.o idmap.o \
 			   callback.o callback_xdr.o callback_proc.o \
 			   nfs4namespace.o
+nfs-$(CONFIG_NFS_V4_1)	+= pnfs.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
 nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 4e7df2a..eed1212 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -48,6 +48,7 @@
 #include "iostat.h"
 #include "internal.h"
 #include "fscache.h"
+#include "pnfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_CLIENT
 
@@ -898,6 +899,8 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *
 	if (server->wsize > NFS_MAX_FILE_IO_SIZE)
 		server->wsize = NFS_MAX_FILE_IO_SIZE;
 	server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	set_pnfs_layoutdriver(server, fsinfo->layouttype);
+
 	server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);
 
 	server->dtsize = nfs_block_size(fsinfo->dtpref, NULL);
@@ -1017,6 +1020,7 @@ void nfs_free_server(struct nfs_server *server)
 {
 	dprintk("--> nfs_free_server()\n");
 
+	unset_pnfs_layoutdriver(server);
 	spin_lock(&nfs_client_lock);
 	list_del(&server->client_link);
 	list_del(&server->master_link);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
new file mode 100644
index 0000000..2e5dba1
--- /dev/null
+++ b/fs/nfs/pnfs.c
@@ -0,0 +1,78 @@
+/*
+ *  pNFS functions to call and manage layout drivers.
+ *
+ *  Copyright (c) 2002 [year of first publication]
+ *  The Regents of the University of Michigan
+ *  All Rights Reserved
+ *
+ *  Dean Hildebrand <dhildebz@umich.edu>
+ *
+ *  Permission is granted to use, copy, create derivative works, and
+ *  redistribute this software and such derivative works for any purpose,
+ *  so long as the name of the University of Michigan is not used in
+ *  any advertising or publicity pertaining to the use or distribution
+ *  of this software without specific, written prior authorization. If
+ *  the above copyright notice or any other identification of the
+ *  University of Michigan is included in any copy of any portion of
+ *  this software, then the disclaimer below must also be included.
+ *
+ *  This software is provided as is, without representation or warranty
+ *  of any kind either express or implied, including without limitation
+ *  the implied warranties of merchantability, fitness for a particular
+ *  purpose, or noninfringement.  The Regents of the University of
+ *  Michigan shall not be liable for any damages, including special,
+ *  indirect, incidental, or consequential damages, with respect to any
+ *  claim arising out of or in connection with the use of the software,
+ *  even if it has been or is hereafter advised of the possibility of
+ *  such damages.
+ */
+
+#include <linux/nfs_fs.h>
+#include "pnfs.h"
+
+#define NFSDBG_FACILITY		NFSDBG_PNFS
+
+/* STUB that returns the equivalent of "no module found" */
+static struct pnfs_layoutdriver_type *
+find_pnfs_driver(u32 id) {
+	return NULL;
+}
+
+/* Unitialize a mountpoint in a layout driver */
+void
+unset_pnfs_layoutdriver(struct nfs_server *nfss)
+{
+	nfss->pnfs_curr_ld = NULL;
+}
+
+/*
+ * Try to set the server's pnfs module to the pnfs layout type specified by id.
+ * Currently only one pNFS layout driver per filesystem is supported.
+ *
+ * @id layout type. Zero (illegal layout type) indicates pNFS not in use.
+ */
+void
+set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
+{
+	struct pnfs_layoutdriver_type *ld_type = NULL;
+
+	if (id == 0)
+		goto out_no_driver;
+	ld_type = find_pnfs_driver(id);
+	if (!ld_type) {
+		request_module("%s-%u", LAYOUT_NFSV4_1_MODULE_PREFIX, id);
+		ld_type = find_pnfs_driver(id);
+		if (!ld_type) {
+			dprintk("%s: No pNFS module found for %u.\n",
+				__func__, id);
+			goto out_no_driver;
+		}
+	}
+	server->pnfs_curr_ld = ld_type;
+	dprintk("%s: pNFS module for %u set\n", __func__, id);
+	return;
+
+out_no_driver:
+	dprintk("%s: Using NFSv4 I/O\n", __func__);
+	server->pnfs_curr_ld = NULL;
+}
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
new file mode 100644
index 0000000..3281fbf
--- /dev/null
+++ b/fs/nfs/pnfs.h
@@ -0,0 +1,36 @@
+/*
+ *  pNFS client data structures.
+ *
+ *  Copyright (c) 2002 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Dean Hildebrand   <dhildebz@umich.edu>
+ */
+
+#ifndef FS_NFS_PNFS_H
+#define FS_NFS_PNFS_H
+
+#ifdef CONFIG_NFS_V4_1
+
+#define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
+
+/* Per-layout driver specific registration structure */
+struct pnfs_layoutdriver_type {
+};
+
+void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
+void unset_pnfs_layoutdriver(struct nfs_server *);
+
+#else  /* CONFIG_NFS_V4_1 */
+
+static inline void set_pnfs_layoutdriver(struct nfs_server *s, u32 id)
+{
+}
+
+static inline void unset_pnfs_layoutdriver(struct nfs_server *s)
+{
+}
+
+#endif /* CONFIG_NFS_V4_1 */
+
+#endif /* FS_NFS_PNFS_H */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 508f8cf..042c2bd 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -613,6 +613,7 @@ extern void * nfs_root_data(void);
 #define NFSDBG_CLIENT		0x0200
 #define NFSDBG_MOUNT		0x0400
 #define NFSDBG_FSCACHE		0x0800
+#define NFSDBG_PNFS		0x1000
 #define NFSDBG_ALL		0xFFFF
 
 #ifdef __KERNEL__
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index c82ee7c..29a821d 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -144,6 +144,7 @@ struct nfs_server {
 	u32			acl_bitmask;	/* V4 bitmask representing the ACEs
 						   that are supported on this
 						   filesystem */
+	struct pnfs_layoutdriver_type  *pnfs_curr_ld; /* Active layout driver */
 #endif
 	void (*destroy)(struct nfs_server *);
 
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (5 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 06/13] RFC: nfs: set layout driver Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-10 19:23   ` Trond Myklebust
                     ` (2 more replies)
  2010-09-02 18:00 ` [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver Fred Isaman
                   ` (5 subsequent siblings)
  12 siblings, 3 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

Allow a module implementing a layout type to register, and
have its mount/umount routines called for filesystems that
the server declares support it.

Signed-off-by: TBD - melding/reorganization of several patches
---
 Documentation/filesystems/nfs/00-INDEX |    2 +
 Documentation/filesystems/nfs/pnfs.txt |   48 +++++++++++++++++++
 fs/nfs/Kconfig                         |    2 +-
 fs/nfs/pnfs.c                          |   79 +++++++++++++++++++++++++++++++-
 fs/nfs/pnfs.h                          |   14 ++++++
 5 files changed, 142 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/filesystems/nfs/pnfs.txt

diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX
index 2f68cd6..8d930b9 100644
--- a/Documentation/filesystems/nfs/00-INDEX
+++ b/Documentation/filesystems/nfs/00-INDEX
@@ -12,5 +12,7 @@ nfs-rdma.txt
 	- how to install and setup the Linux NFS/RDMA client and server software
 nfsroot.txt
 	- short guide on setting up a diskless box with NFS root filesystem.
+pnfs.txt
+	- short explanation of some of the internals of the pnfs code
 rpc-cache.txt
 	- introduction to the caching mechanisms in the sunrpc layer.
diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt
new file mode 100644
index 0000000..bc0b9cf
--- /dev/null
+++ b/Documentation/filesystems/nfs/pnfs.txt
@@ -0,0 +1,48 @@
+Reference counting in pnfs:
+==========================
+
+The are several inter-related caches.  We have layouts which can
+reference multiple devices, each of which can reference multiple data servers.
+Each data server can be referenced by multiple devices.  Each device
+can be referenced by multiple layouts.  To keep all of this straight,
+we need to reference count.
+
+
+struct pnfs_layout_hdr
+----------------------
+The on-the-wire command LAYOUTGET corresponds to struct
+pnfs_layout_segment, usually referred to by the variable name lseg.
+Each nfs_inode may hold a pointer to a cache of of these layout
+segments in nfsi->layout, of type struct pnfs_layout_hdr.
+
+We reference the header for the inode pointing to it, across each
+outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
+LAYOUTCOMMIT), and for each lseg held within.
+
+Each header is also (when non-empty) put on a list associated with
+struct nfs_client (cl_layouts).  Being put on this list does not bump
+the reference count, as the layout is kept around by the lseg that
+keeps it in the list.
+
+deviceid_cache
+--------------
+lsegs reference device ids, which are resolved per nfs_client and
+layout driver type.  The device ids are held in a RCU cache (struct
+nfs4_deviceid_cache).  The cache itself is referenced across each
+mount.  The entries (struct nfs4_deviceid) themselves are held across
+the lifetime of each lseg referencing them.
+
+RCU is used because the deviceid is basically a write once, read many
+data structure.  The hlist size of 32 buckets needs better
+justification, but seems reasonable given that we can have multiple
+deviceid's per filesystem, and multiple filesystems per nfs_client.
+
+The hash code is copied from the nfsd code base.  A discussion of
+hashing and variations of this algorithm can be found at:
+http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
+
+data server cache
+-----------------
+file driver devices refer to data servers, which are kept in a module
+level cache.  Its reference is held over the lifetime of the deviceid
+pointing to it.
diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 6c2aad4..5f1b936 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -78,7 +78,7 @@ config NFS_V4_1
 	depends on NFS_V4 && EXPERIMENTAL
 	help
 	  This option enables support for minor version 1 of the NFSv4 protocol
-	  (draft-ietf-nfsv4-minorversion1) in the kernel's NFS client.
+	  (RFC 5661) in the kernel's NFS client.
 
 	  If unsure, say N.
 
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 2e5dba1..8d503fc 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -32,16 +32,48 @@
 
 #define NFSDBG_FACILITY		NFSDBG_PNFS
 
-/* STUB that returns the equivalent of "no module found" */
+/* Locking:
+ *
+ * pnfs_spinlock:
+ *      protects pnfs_modules_tbl.
+ */
+static DEFINE_SPINLOCK(pnfs_spinlock);
+
+/*
+ * pnfs_modules_tbl holds all pnfs modules
+ */
+static LIST_HEAD(pnfs_modules_tbl);
+
+/* Return the registered pnfs layout driver module matching given id */
+static struct pnfs_layoutdriver_type *
+find_pnfs_driver_locked(u32 id) {
+	struct  pnfs_layoutdriver_type *local;
+
+	dprintk("PNFS: %s: Searching for %u\n", __func__, id);
+	list_for_each_entry(local, &pnfs_modules_tbl, pnfs_tblid)
+		if (local->id == id)
+			goto out;
+	local = NULL;
+out:
+	return local;
+}
+
 static struct pnfs_layoutdriver_type *
 find_pnfs_driver(u32 id) {
-	return NULL;
+	struct  pnfs_layoutdriver_type *local;
+
+	spin_lock(&pnfs_spinlock);
+	local = find_pnfs_driver_locked(id);
+	spin_unlock(&pnfs_spinlock);
+	return local;
 }
 
 /* Unitialize a mountpoint in a layout driver */
 void
 unset_pnfs_layoutdriver(struct nfs_server *nfss)
 {
+	if (nfss->pnfs_curr_ld)
+		nfss->pnfs_curr_ld->ld_io_ops->uninitialize_mountpoint(nfss->nfs_client);
 	nfss->pnfs_curr_ld = NULL;
 }
 
@@ -68,6 +100,12 @@ set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
 			goto out_no_driver;
 		}
 	}
+	if (ld_type->ld_io_ops->initialize_mountpoint(server->nfs_client)) {
+		printk(KERN_ERR
+		       "%s: Error initializing mount point for layout driver %u.\n",
+		       __func__, id);
+		goto out_no_driver;
+	}
 	server->pnfs_curr_ld = ld_type;
 	dprintk("%s: pNFS module for %u set\n", __func__, id);
 	return;
@@ -76,3 +114,40 @@ out_no_driver:
 	dprintk("%s: Using NFSv4 I/O\n", __func__);
 	server->pnfs_curr_ld = NULL;
 }
+
+int
+pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
+{
+	struct layoutdriver_io_operations *io_ops = ld_type->ld_io_ops;
+	int status = -EINVAL;
+
+	if (!io_ops) {
+		printk(KERN_ERR "%s Layout driver must provide io_ops\n",
+			__func__);
+		return status;
+	}
+
+	spin_lock(&pnfs_spinlock);
+	if (!find_pnfs_driver_locked(ld_type->id)) {
+		list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
+		status = 0;
+		dprintk("%s Registering id:%u name:%s\n", __func__, ld_type->id,
+			ld_type->name);
+	} else
+		printk(KERN_ERR "%s Module with id %d already loaded!\n",
+			__func__, ld_type->id);
+	spin_unlock(&pnfs_spinlock);
+
+	return status;
+}
+EXPORT_SYMBOL(pnfs_register_layoutdriver);
+
+void
+pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
+{
+	dprintk("%s Deregistering id:%u\n", __func__, ld_type->id);
+	spin_lock(&pnfs_spinlock);
+	list_del(&ld_type->pnfs_tblid);
+	spin_unlock(&pnfs_spinlock);
+}
+EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 3281fbf..9049b9a 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -16,8 +16,22 @@
 
 /* Per-layout driver specific registration structure */
 struct pnfs_layoutdriver_type {
+	struct list_head pnfs_tblid;
+	const u32 id;
+	const char *name;
+	struct layoutdriver_io_operations *ld_io_ops;
 };
 
+/* Layout driver I/O operations. */
+struct layoutdriver_io_operations {
+	/* Registration information for a new mounted file system */
+	int (*initialize_mountpoint) (struct nfs_client *);
+	int (*uninitialize_mountpoint) (struct nfs_client *);
+};
+
+extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
+extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
+
 void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
 void unset_pnfs_layoutdriver(struct nfs_server *);
 
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (6 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-10 19:31   ` Trond Myklebust
  2010-09-13 15:08   ` Christoph Hellwig
  2010-09-02 18:00 ` [PATCH 09/13] RFC: nfs: create and destroy inode's layout cache Fred Isaman
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

This driver just registers itself and supplies trivial mount/umount functions.

Signed-off-by: TBD - melding/reorganization of several patches
---
 fs/nfs/Kconfig          |    5 +++
 fs/nfs/Makefile         |    3 ++
 fs/nfs/nfs4filelayout.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/nfs_fs.h  |    1 +
 4 files changed, 98 insertions(+), 0 deletions(-)
 create mode 100644 fs/nfs/nfs4filelayout.c

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 5f1b936..980f2dc 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -82,6 +82,11 @@ config NFS_V4_1
 
 	  If unsure, say N.
 
+config PNFS_FILE_LAYOUT
+	tristate
+	depends on NFS_FS && NFS_V4_1
+	default m
+
 config ROOT_NFS
 	bool "Root file system on NFS"
 	depends on NFS_FS=y && IP_PNP
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index bb9e773..08a8889 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -18,3 +18,6 @@ nfs-$(CONFIG_NFS_V4)	+= nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
 nfs-$(CONFIG_NFS_V4_1)	+= pnfs.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
 nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
+
+obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
+nfs_layout_nfsv41_files-y := nfs4filelayout.o
diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
new file mode 100644
index 0000000..c685196
--- /dev/null
+++ b/fs/nfs/nfs4filelayout.c
@@ -0,0 +1,89 @@
+/*
+ *  Module for the pnfs nfs4 file layout driver.
+ *  Defines all I/O and Policy interface operations, plus code
+ *  to register itself with the pNFS client.
+ *
+ *  Copyright (c) 2002
+ *  The Regents of the University of Michigan
+ *  All Rights Reserved
+ *
+ *  Dean Hildebrand <dhildebz@umich.edu>
+ *
+ *  Permission is granted to use, copy, create derivative works, and
+ *  redistribute this software and such derivative works for any purpose,
+ *  so long as the name of the University of Michigan is not used in
+ *  any advertising or publicity pertaining to the use or distribution
+ *  of this software without specific, written prior authorization. If
+ *  the above copyright notice or any other identification of the
+ *  University of Michigan is included in any copy of any portion of
+ *  this software, then the disclaimer below must also be included.
+ *
+ *  This software is provided as is, without representation or warranty
+ *  of any kind either express or implied, including without limitation
+ *  the implied warranties of merchantability, fitness for a particular
+ *  purpose, or noninfringement.  The Regents of the University of
+ *  Michigan shall not be liable for any damages, including special,
+ *  indirect, incidental, or consequential damages, with respect to any
+ *  claim arising out of or in connection with the use of the software,
+ *  even if it has been or is hereafter advised of the possibility of
+ *  such damages.
+ */
+
+#include <linux/nfs_fs.h>
+#include "pnfs.h"
+
+#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Dean Hildebrand <dhildebz@umich.edu>");
+MODULE_DESCRIPTION("The NFSv4 file layout driver");
+
+int
+filelayout_initialize_mountpoint(struct nfs_client *clp)
+{
+	return 0;
+}
+
+int
+filelayout_uninitialize_mountpoint(struct nfs_client *clp)
+{
+	dprintk("--> %s\n", __func__);
+
+	return 0;
+}
+
+struct layoutdriver_io_operations filelayout_io_operations = {
+	.initialize_mountpoint   = filelayout_initialize_mountpoint,
+	.uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
+};
+
+
+struct pnfs_layoutdriver_type filelayout_type = {
+	.id = LAYOUT_NFSV4_1_FILES,
+	.name = "LAYOUT_NFSV4_1_FILES",
+	.ld_io_ops = &filelayout_io_operations,
+};
+
+static int __init nfs4filelayout_init(void)
+{
+	printk(KERN_INFO "%s: NFSv4 File Layout Driver Registering...\n",
+	       __func__);
+
+	/*
+	 * Need to register file_operations struct with global list to indicate
+	 * that NFS4 file layout is a possible pNFS I/O module
+	 */
+	return pnfs_register_layoutdriver(&filelayout_type);
+}
+
+static void __exit nfs4filelayout_exit(void)
+{
+	printk(KERN_INFO "%s: NFSv4 File Layout Driver Unregistering...\n",
+	       __func__);
+
+	/* Unregister NFS4 file layout driver with pNFS client*/
+	pnfs_unregister_layoutdriver(&filelayout_type);
+}
+
+module_init(nfs4filelayout_init);
+module_exit(nfs4filelayout_exit);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 042c2bd..a0f49a3 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -614,6 +614,7 @@ extern void * nfs_root_data(void);
 #define NFSDBG_MOUNT		0x0400
 #define NFSDBG_FSCACHE		0x0800
 #define NFSDBG_PNFS		0x1000
+#define NFSDBG_PNFS_LD		0x2000
 #define NFSDBG_ALL		0xFFFF
 
 #ifdef __KERNEL__
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/13] RFC: nfs: create and destroy inode's layout cache
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (7 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-10 19:43   ` Trond Myklebust
  2010-09-02 18:00 ` [PATCH 10/13] RFC: nfs: client needs to maintain list of inodes with active layouts Fred Isaman
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

At the start of the io paths, try to grab the relevant layout
information.  This will initiate the inode's layout cache, but
stubs ensure the cache stays empty.

Signed-off-by: TBD - melding/reorganization of several patches
---
 fs/nfs/file.c          |    5 ++
 fs/nfs/inode.c         |    3 +
 fs/nfs/pnfs.c          |  140 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/pnfs.h          |   39 +++++++++++++
 fs/nfs/read.c          |    3 +
 include/linux/nfs_fs.h |    3 +
 6 files changed, 193 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index eb51bd6..10ebdfb 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -36,6 +36,7 @@
 #include "internal.h"
 #include "iostat.h"
 #include "fscache.h"
+#include "pnfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_FILE
 
@@ -386,6 +387,10 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
 		file->f_path.dentry->d_name.name,
 		mapping->host->i_ino, len, (long long) pos);
 
+	pnfs_update_layout(mapping->host,
+			   nfs_file_open_context(file),
+			   IOMODE_RW);
+
 start:
 	/*
 	 * Prevent starvation issues if someone is doing a consistency
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7d2d6c7..0dc6dad 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -48,6 +48,7 @@
 #include "internal.h"
 #include "fscache.h"
 #include "dns_resolve.h"
+#include "pnfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_VFS
 
@@ -1409,6 +1410,7 @@ void nfs4_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
 	end_writeback(inode);
+	pnfs_destroy_layout(NFS_I(inode));
 	/* If we are holding a delegation, return it! */
 	nfs_inode_return_delegation_noreclaim(inode);
 	/* First call standard NFS clear_inode() code */
@@ -1446,6 +1448,7 @@ static inline void nfs4_init_once(struct nfs_inode *nfsi)
 	nfsi->delegation = NULL;
 	nfsi->delegation_state = 0;
 	init_rwsem(&nfsi->rwsem);
+	nfsi->layout = NULL;
 #endif
 }
 
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 8d503fc..65f923b 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -151,3 +151,143 @@ pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
 	spin_unlock(&pnfs_spinlock);
 }
 EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
+
+static void
+get_layout_hdr_locked(struct pnfs_layout_hdr *lo)
+{
+	assert_spin_locked(&lo->inode->i_lock);
+	lo->refcount++;
+}
+
+static void
+put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
+{
+	assert_spin_locked(&lo->inode->i_lock);
+	BUG_ON(lo->refcount <= 0);
+
+	lo->refcount--;
+	if (!lo->refcount) {
+		dprintk("%s: freeing layout cache %p\n", __func__, lo);
+		NFS_I(lo->inode)->layout = NULL;
+		kfree(lo);
+	}
+}
+
+void
+pnfs_destroy_layout(struct nfs_inode *nfsi)
+{
+	struct pnfs_layout_hdr *lo;
+
+	spin_lock(&nfsi->vfs_inode.i_lock);
+	lo = nfsi->layout;
+	if (lo) {
+		/* Matched by refcount set to 1 in alloc_init_layout_hdr */
+		put_layout_hdr_locked(lo);
+	}
+	spin_unlock(&nfsi->vfs_inode.i_lock);
+}
+
+/* STUB - pretend LAYOUTGET to server failed */
+static struct pnfs_layout_segment *
+send_layoutget(struct pnfs_layout_hdr *lo,
+	   struct nfs_open_context *ctx,
+	   u32 iomode)
+{
+	struct inode *ino = lo->inode;
+
+	set_bit(lo_fail_bit(iomode), &lo->state);
+	spin_lock(&ino->i_lock);
+	put_layout_hdr_locked(lo);
+	spin_unlock(&ino->i_lock);
+	return NULL;
+}
+
+static struct pnfs_layout_hdr *
+alloc_init_layout_hdr(struct inode *ino)
+{
+	struct pnfs_layout_hdr *lo;
+
+	lo = kzalloc(sizeof(struct pnfs_layout_hdr), GFP_KERNEL);
+	if (!lo)
+		return NULL;
+	lo->refcount = 1;
+	lo->inode = ino;
+	return lo;
+}
+
+static struct pnfs_layout_hdr *
+pnfs_find_alloc_layout(struct inode *ino)
+{
+	struct nfs_inode *nfsi = NFS_I(ino);
+	struct pnfs_layout_hdr *new = NULL;
+
+	dprintk("%s Begin ino=%p layout=%p\n", __func__, ino, nfsi->layout);
+
+	assert_spin_locked(&ino->i_lock);
+	if (nfsi->layout)
+		return nfsi->layout;
+
+	spin_unlock(&ino->i_lock);
+	new = alloc_init_layout_hdr(ino);
+	spin_lock(&ino->i_lock);
+
+	if (likely(nfsi->layout == NULL))	/* Won the race? */
+		nfsi->layout = new;
+	else
+		kfree(new);
+	return nfsi->layout;
+}
+
+/* STUB - LAYOUTGET never succeeds, so cache is empty */
+static struct pnfs_layout_segment *
+pnfs_has_layout(struct pnfs_layout_hdr *lo, u32 iomode)
+{
+	return NULL;
+}
+
+/*
+ * Layout segment is retreived from the server if not cached.
+ * The appropriate layout segment is referenced and returned to the caller.
+ */
+struct pnfs_layout_segment *
+pnfs_update_layout(struct inode *ino,
+		   struct nfs_open_context *ctx,
+		   enum pnfs_iomode iomode)
+{
+	struct nfs_inode *nfsi = NFS_I(ino);
+	struct pnfs_layout_hdr *lo;
+	struct pnfs_layout_segment *lseg = NULL;
+
+	if (!pnfs_enabled_sb(NFS_SERVER(ino)))
+		return NULL;
+	spin_lock(&ino->i_lock);
+	lo = pnfs_find_alloc_layout(ino);
+	if (lo == NULL) {
+		dprintk("%s ERROR: can't get pnfs_layout_hdr\n", __func__);
+		goto out_unlock;
+	}
+
+	/* Check to see if the layout for the given range already exists */
+	lseg = pnfs_has_layout(lo, iomode);
+	if (lseg) {
+		dprintk("%s: Using cached lseg %p for iomode %d)\n",
+			__func__, lseg, iomode);
+		goto out_unlock;
+	}
+
+	/* if LAYOUTGET already failed once we don't try again */
+	if (test_bit(lo_fail_bit(iomode), &nfsi->layout->state))
+		goto out_unlock;
+
+	get_layout_hdr_locked(lo);
+	spin_unlock(&ino->i_lock);
+
+	lseg = send_layoutget(lo, ctx, iomode);
+out:
+	dprintk("%s end, state 0x%lx lseg %p\n", __func__,
+		nfsi->layout->state, lseg);
+	return lseg;
+out_unlock:
+	spin_unlock(&ino->i_lock);
+	goto out;
+}
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 9049b9a..b63b445 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -14,6 +14,11 @@
 
 #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
 
+enum {
+	NFS_LAYOUT_RO_FAILED = 0,	/* get ro layout failed stop trying */
+	NFS_LAYOUT_RW_FAILED,		/* get rw layout failed stop trying */
+};
+
 /* Per-layout driver specific registration structure */
 struct pnfs_layoutdriver_type {
 	struct list_head pnfs_tblid;
@@ -22,6 +27,12 @@ struct pnfs_layoutdriver_type {
 	struct layoutdriver_io_operations *ld_io_ops;
 };
 
+struct pnfs_layout_hdr {
+	int			refcount;
+	unsigned long		state;
+	struct inode		*inode;
+};
+
 /* Layout driver I/O operations. */
 struct layoutdriver_io_operations {
 	/* Registration information for a new mounted file system */
@@ -32,11 +43,39 @@ struct layoutdriver_io_operations {
 extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
 extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
 
+struct pnfs_layout_segment *
+pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
+		   enum pnfs_iomode access_type);
 void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
 void unset_pnfs_layoutdriver(struct nfs_server *);
+void pnfs_destroy_layout(struct nfs_inode *);
+
+
+static inline int lo_fail_bit(u32 iomode)
+{
+	return iomode == IOMODE_RW ?
+			 NFS_LAYOUT_RW_FAILED : NFS_LAYOUT_RO_FAILED;
+}
+
+/* Return true if a layout driver is being used for this mountpoint */
+static inline int pnfs_enabled_sb(struct nfs_server *nfss)
+{
+	return nfss->pnfs_curr_ld != NULL;
+}
 
 #else  /* CONFIG_NFS_V4_1 */
 
+static inline void pnfs_destroy_layout(struct nfs_inode *nfsi)
+{
+}
+
+static inline struct pnfs_layout_segment *
+pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
+		   enum pnfs_iomode access_type)
+{
+	return NULL;
+}
+
 static inline void set_pnfs_layoutdriver(struct nfs_server *s, u32 id)
 {
 }
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 87adc27..f7eb66f 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -25,6 +25,7 @@
 #include "internal.h"
 #include "iostat.h"
 #include "fscache.h"
+#include "pnfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
@@ -121,6 +122,7 @@ int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
 	len = nfs_page_length(page);
 	if (len == 0)
 		return nfs_return_empty_page(page);
+	pnfs_update_layout(inode, ctx, IOMODE_READ);
 	new = nfs_create_request(ctx, inode, page, 0, len);
 	if (IS_ERR(new)) {
 		unlock_page(page);
@@ -625,6 +627,7 @@ int nfs_readpages(struct file *filp, struct address_space *mapping,
 	if (ret == 0)
 		goto read_complete; /* all pages were read */
 
+	pnfs_update_layout(inode, desc.ctx, IOMODE_READ);
 	if (rsize < PAGE_CACHE_SIZE)
 		nfs_pageio_init(&pgio, inode, nfs_pagein_multi, rsize, 0);
 	else
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a0f49a3..ebd87a9 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -188,6 +188,9 @@ struct nfs_inode {
 	struct nfs_delegation	*delegation;
 	fmode_t			 delegation_state;
 	struct rw_semaphore	rwsem;
+
+	/* pNFS layout information */
+	struct pnfs_layout_hdr *layout;
 #endif /* CONFIG_NFS_V4*/
 #ifdef CONFIG_NFS_FSCACHE
 	struct fscache_cookie	*fscache;
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/13] RFC: nfs: client needs to maintain list of inodes with active layouts
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (8 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 09/13] RFC: nfs: create and destroy inode's layout cache Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-10 19:59   ` Trond Myklebust
  2010-09-02 18:00 ` [PATCH 11/13] RFC: nfs: retry on certain pnfs errors Fred Isaman
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

In particular, server reboot will invalidate all layouts.

Note that in order to have an active layout, we must get a successful response
from the server.  To avoid adding that machinery, this patch just includes a
stub that fakes up a successful return.  Since the layout is never referenced
for io, this is not a problem.

Signed-off-by: TBD - melding/reorganization of several patches
---
 fs/nfs/client.c           |    4 +-
 fs/nfs/nfs4state.c        |    2 +
 fs/nfs/pnfs.c             |  130 +++++++++++++++++++++++++++++++++++++++++++-
 fs/nfs/pnfs.h             |   20 +++++++
 include/linux/nfs_fs_sb.h |    1 +
 5 files changed, 153 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index eed1212..6fc5c84 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -156,7 +156,9 @@ static struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_
 	cred = rpc_lookup_machine_cred();
 	if (!IS_ERR(cred))
 		clp->cl_machine_cred = cred;
-
+#if defined(CONFIG_NFS_V4_1)
+	INIT_LIST_HEAD(&clp->cl_layouts);
+#endif
 	nfs_fscache_get_client_cookie(clp);
 
 	return clp;
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 3e2f19b..b53a4ce 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -53,6 +53,7 @@
 #include "callback.h"
 #include "delegation.h"
 #include "internal.h"
+#include "pnfs.h"
 
 #define OPENOWNER_POOL_SIZE	8
 
@@ -1447,6 +1448,7 @@ static void nfs4_state_manager(struct nfs_client *clp)
 			}
 			clear_bit(NFS4CLNT_CHECK_LEASE, &clp->cl_state);
 			set_bit(NFS4CLNT_RECLAIM_REBOOT, &clp->cl_state);
+			pnfs_destroy_all_layouts(clp);
 		}
 
 		if (test_and_clear_bit(NFS4CLNT_CHECK_LEASE, &clp->cl_state)) {
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 65f923b..cbce942 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -28,6 +28,7 @@
  */
 
 #include <linux/nfs_fs.h>
+#include "internal.h"
 #include "pnfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_PNFS
@@ -168,11 +169,67 @@ put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
 	lo->refcount--;
 	if (!lo->refcount) {
 		dprintk("%s: freeing layout cache %p\n", __func__, lo);
+		BUG_ON(!list_empty(&lo->layouts));
 		NFS_I(lo->inode)->layout = NULL;
 		kfree(lo);
 	}
 }
 
+static void
+init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
+{
+	INIT_LIST_HEAD(&lseg->fi_list);
+	kref_init(&lseg->kref);
+	lseg->layout = lo;
+}
+
+static void
+destroy_lseg(struct kref *kref)
+{
+	struct pnfs_layout_segment *lseg =
+		container_of(kref, struct pnfs_layout_segment, kref);
+	struct pnfs_layout_hdr *local = lseg->layout;
+
+	dprintk("--> %s\n", __func__);
+	kfree(lseg);
+	/* Matched by get_layout_hdr_locked in pnfs_insert_layout */
+	put_layout_hdr_locked(local);
+}
+
+static void
+put_lseg_locked(struct pnfs_layout_segment *lseg)
+{
+	if (!lseg)
+		return;
+
+	dprintk("%s: lseg %p ref %d\n", __func__, lseg,
+		atomic_read(&lseg->kref.refcount));
+	kref_put(&lseg->kref, destroy_lseg);
+}
+
+static void
+pnfs_clear_lseg_list(struct pnfs_layout_hdr *lo)
+{
+	struct pnfs_layout_segment *lseg, *next;
+	struct nfs_client *clp;
+
+	dprintk("%s:Begin lo %p\n", __func__, lo);
+
+	assert_spin_locked(&lo->inode->i_lock);
+	list_for_each_entry_safe(lseg, next, &lo->segs, fi_list) {
+		dprintk("%s: freeing lseg %p\n", __func__, lseg);
+		list_del(&lseg->fi_list);
+		put_lseg_locked(lseg);
+	}
+	clp = PNFS_NFS_SERVER(lo)->nfs_client;
+	spin_lock(&clp->cl_lock);
+	/* List does not take a reference, so no need for put here */
+	list_del_init(&lo->layouts);
+	spin_unlock(&clp->cl_lock);
+
+	dprintk("%s:Return\n", __func__);
+}
+
 void
 pnfs_destroy_layout(struct nfs_inode *nfsi)
 {
@@ -181,25 +238,90 @@ pnfs_destroy_layout(struct nfs_inode *nfsi)
 	spin_lock(&nfsi->vfs_inode.i_lock);
 	lo = nfsi->layout;
 	if (lo) {
+		pnfs_clear_lseg_list(lo);
 		/* Matched by refcount set to 1 in alloc_init_layout_hdr */
 		put_layout_hdr_locked(lo);
 	}
 	spin_unlock(&nfsi->vfs_inode.i_lock);
 }
 
-/* STUB - pretend LAYOUTGET to server failed */
+/*
+ * Called by the state manger to remove all layouts established under an
+ * expired lease.
+ */
+void
+pnfs_destroy_all_layouts(struct nfs_client *clp)
+{
+	struct pnfs_layout_hdr *lo;
+	LIST_HEAD(tmp_list);
+
+	spin_lock(&clp->cl_lock);
+	list_splice_init(&clp->cl_layouts, &tmp_list);
+	spin_unlock(&clp->cl_lock);
+
+	while (!list_empty(&tmp_list)) {
+		lo = list_entry(tmp_list.next, struct pnfs_layout_hdr,
+				layouts);
+		dprintk("%s freeing layout for inode %lu\n", __func__,
+			lo->inode->i_ino);
+		pnfs_destroy_layout(NFS_I(lo->inode));
+	}
+}
+
+static void pnfs_insert_layout(struct pnfs_layout_hdr *lo,
+			       struct pnfs_layout_segment *lseg);
+
+/* Get layout from server. */
 static struct pnfs_layout_segment *
 send_layoutget(struct pnfs_layout_hdr *lo,
 	   struct nfs_open_context *ctx,
 	   u32 iomode)
 {
 	struct inode *ino = lo->inode;
+	struct pnfs_layout_segment *lseg;
 
-	set_bit(lo_fail_bit(iomode), &lo->state);
+	/* Lets pretend we sent LAYOUTGET and got a response */
+	lseg = kzalloc(sizeof(*lseg), GFP_KERNEL);
+	if (!lseg) {
+		set_bit(lo_fail_bit(iomode), &lo->state);
+		spin_lock(&ino->i_lock);
+		put_layout_hdr_locked(lo);
+		spin_unlock(&ino->i_lock);
+		return NULL;
+	}
+	init_lseg(lo, lseg);
+	lseg->iomode = IOMODE_RW;
 	spin_lock(&ino->i_lock);
+	pnfs_insert_layout(lo, lseg);
 	put_layout_hdr_locked(lo);
 	spin_unlock(&ino->i_lock);
-	return NULL;
+	return lseg;
+}
+
+static void
+pnfs_insert_layout(struct pnfs_layout_hdr *lo,
+		   struct pnfs_layout_segment *lseg)
+{
+	dprintk("%s:Begin\n", __func__);
+
+	assert_spin_locked(&lo->inode->i_lock);
+	if (list_empty(&lo->segs)) {
+		struct nfs_client *clp = PNFS_NFS_SERVER(lo)->nfs_client;
+
+		spin_lock(&clp->cl_lock);
+		BUG_ON(!list_empty(&lo->layouts));
+		list_add_tail(&lo->layouts, &clp->cl_layouts);
+		spin_unlock(&clp->cl_lock);
+	}
+	/* STUB - add the constructed lseg if necessary */
+	if (list_empty(&lo->segs)) {
+		list_add_tail(&lseg->fi_list, &lo->segs);
+		get_layout_hdr_locked(lo);
+		dprintk("%s: inserted lseg %p iomode %d at tail\n",
+			__func__, lseg, lseg->iomode);
+	}
+
+	dprintk("%s:Return\n", __func__);
 }
 
 static struct pnfs_layout_hdr *
@@ -211,6 +333,8 @@ alloc_init_layout_hdr(struct inode *ino)
 	if (!lo)
 		return NULL;
 	lo->refcount = 1;
+	INIT_LIST_HEAD(&lo->layouts);
+	INIT_LIST_HEAD(&lo->segs);
 	lo->inode = ino;
 	return lo;
 }
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index b63b445..dac6a72 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -10,6 +10,13 @@
 #ifndef FS_NFS_PNFS_H
 #define FS_NFS_PNFS_H
 
+struct pnfs_layout_segment {
+	struct list_head fi_list;
+	u32 iomode;
+	struct kref kref;
+	struct pnfs_layout_hdr *layout;
+};
+
 #ifdef CONFIG_NFS_V4_1
 
 #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
@@ -29,6 +36,8 @@ struct pnfs_layoutdriver_type {
 
 struct pnfs_layout_hdr {
 	int			refcount;
+	struct list_head	layouts;   /* other client layouts */
+	struct list_head	segs;      /* layout segments list */
 	unsigned long		state;
 	struct inode		*inode;
 };
@@ -43,12 +52,19 @@ struct layoutdriver_io_operations {
 extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
 extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
 
+static inline struct nfs_server *
+PNFS_NFS_SERVER(struct pnfs_layout_hdr *lo)
+{
+	return NFS_SERVER(lo->inode);
+}
+
 struct pnfs_layout_segment *
 pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
 		   enum pnfs_iomode access_type);
 void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
 void unset_pnfs_layoutdriver(struct nfs_server *);
 void pnfs_destroy_layout(struct nfs_inode *);
+void pnfs_destroy_all_layouts(struct nfs_client *);
 
 
 static inline int lo_fail_bit(u32 iomode)
@@ -65,6 +81,10 @@ static inline int pnfs_enabled_sb(struct nfs_server *nfss)
 
 #else  /* CONFIG_NFS_V4_1 */
 
+static inline void pnfs_destroy_all_layouts(struct nfs_client *clp)
+{
+}
+
 static inline void pnfs_destroy_layout(struct nfs_inode *nfsi)
 {
 }
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 29a821d..e670a9c 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -82,6 +82,7 @@ struct nfs_client {
 	/* The flags used for obtaining the clientid during EXCHANGE_ID */
 	u32			cl_exchange_flags;
 	struct nfs4_session	*cl_session; 	/* sharred session */
+	struct list_head	cl_layouts;
 #endif /* CONFIG_NFS_V4_1 */
 
 #ifdef CONFIG_NFS_FSCACHE
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/13] RFC: nfs: retry on certain pnfs errors
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (9 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 10/13] RFC: nfs: client needs to maintain list of inodes with active layouts Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-02 18:00 ` [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure Fred Isaman
  2010-09-02 18:00 ` [PATCH 13/13] RFC: pnfs: filelayout: add driver's " Fred Isaman
  12 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

Add to the list of errors that cause the client to keep
retrying with exponential backoff.

Signed-off-by: TBD - melding/reorganization of several patches
---
 fs/nfs/nfs4proc.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 9e6b086..c7c7277 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -286,6 +286,8 @@ static int nfs4_handle_exception(const struct nfs_server *server, int errorcode,
 		case -NFS4ERR_GRACE:
 		case -NFS4ERR_DELAY:
 		case -EKEYEXPIRED:
+		case -NFS4ERR_LAYOUTTRYLATER:
+		case -NFS4ERR_RECALLCONFLICT:
 			ret = nfs4_delay(server->client, &exception->timeout);
 			if (ret != 0)
 				break;
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (10 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 11/13] RFC: nfs: retry on certain pnfs errors Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-10 20:11   ` Trond Myklebust
  2010-09-02 18:00 ` [PATCH 13/13] RFC: pnfs: filelayout: add driver's " Fred Isaman
  12 siblings, 1 reply; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

Add the ability to actually send LAYOUTGET and GETDEVICEINFO.  This also adds
in the machinery to handle layout state and the deviceid cache.  Note that
GETDEVICEINFO is not called directly by the generic layer.  Instead it
is called by the drivers while parsing the LAYOUTGET opaque data in response
to an unknown device id embedded therein.  Annoyingly, RFC 5661 only encodes
device ids within the driver-specific opaque data.

Signed-off-by: TBD - melding/reorganization of several patches
---
 fs/nfs/nfs4proc.c         |  134 ++++++++++++++++
 fs/nfs/nfs4xdr.c          |  302 +++++++++++++++++++++++++++++++++++
 fs/nfs/pnfs.c             |  382 ++++++++++++++++++++++++++++++++++++++++++---
 fs/nfs/pnfs.h             |   91 +++++++++++-
 include/linux/nfs4.h      |    2 +
 include/linux/nfs_fs_sb.h |    1 +
 include/linux/nfs_xdr.h   |   49 ++++++
 7 files changed, 935 insertions(+), 26 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index c7c7277..7eeea0e 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -55,6 +55,7 @@
 #include "internal.h"
 #include "iostat.h"
 #include "callback.h"
+#include "pnfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_PROC
 
@@ -5335,6 +5336,139 @@ out:
 	dprintk("<-- %s status=%d\n", __func__, status);
 	return status;
 }
+
+static void
+nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
+{
+	struct nfs4_layoutget *lgp = calldata;
+	struct inode *ino = lgp->args.inode;
+	struct nfs_server *server = NFS_SERVER(ino);
+
+	dprintk("--> %s\n", __func__);
+	if (nfs4_setup_sequence(server, &lgp->args.seq_args,
+				&lgp->res.seq_res, 0, task))
+		return;
+	rpc_call_start(task);
+}
+
+static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
+{
+	struct nfs4_layoutget *lgp = calldata;
+	struct inode *ino = lgp->args.inode;
+	struct nfs_server *server = NFS_SERVER(ino);
+
+	dprintk("--> %s\n", __func__);
+
+	if (!nfs4_sequence_done(task, &lgp->res.seq_res))
+		return;
+
+	if (RPC_ASSASSINATED(task))
+		return;
+
+	if (nfs4_async_handle_error(task, server, NULL) == -EAGAIN)
+		nfs_restart_rpc(task, server->nfs_client);
+
+	lgp->status = task->tk_status;
+	dprintk("<-- %s\n", __func__);
+}
+
+static void nfs4_layoutget_release(void *calldata)
+{
+	struct nfs4_layoutget *lgp = calldata;
+
+	dprintk("--> %s\n", __func__);
+	put_layout_hdr(lgp->args.inode);
+	if (lgp->res.layout.buf != NULL)
+		free_page((unsigned long) lgp->res.layout.buf);
+	put_nfs_open_context(lgp->args.ctx);
+	kfree(calldata);
+	dprintk("<-- %s\n", __func__);
+}
+
+static const struct rpc_call_ops nfs4_layoutget_call_ops = {
+	.rpc_call_prepare = nfs4_layoutget_prepare,
+	.rpc_call_done = nfs4_layoutget_done,
+	.rpc_release = nfs4_layoutget_release,
+};
+
+static int _nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
+{
+	struct nfs_server *server = NFS_SERVER(lgp->args.inode);
+	struct rpc_task *task;
+	struct rpc_message msg = {
+		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_LAYOUTGET],
+		.rpc_argp = &lgp->args,
+		.rpc_resp = &lgp->res,
+	};
+	struct rpc_task_setup task_setup_data = {
+		.rpc_client = server->client,
+		.rpc_message = &msg,
+		.callback_ops = &nfs4_layoutget_call_ops,
+		.callback_data = lgp,
+		.flags = RPC_TASK_ASYNC,
+	};
+	int status = 0;
+
+	dprintk("--> %s\n", __func__);
+
+	lgp->res.layout.buf = (void *)__get_free_page(GFP_NOFS);
+	if (lgp->res.layout.buf == NULL) {
+		nfs4_layoutget_release(lgp);
+		return -ENOMEM;
+	}
+
+	lgp->res.seq_res.sr_slotid = NFS4_MAX_SLOT_TABLE;
+	task = rpc_run_task(&task_setup_data);
+	if (IS_ERR(task))
+		return PTR_ERR(task);
+	status = nfs4_wait_for_completion_rpc_task(task);
+	if (status != 0)
+		goto out;
+	status = lgp->status;
+	if (status != 0)
+		goto out;
+	status = pnfs_layout_process(lgp);
+out:
+	rpc_put_task(task);
+	dprintk("<-- %s status=%d\n", __func__, status);
+	return status;
+}
+
+int nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
+{
+	struct nfs_server *server = NFS_SERVER(lgp->args.inode);
+	struct nfs4_exception exception = { };
+	int err;
+	do {
+		err = nfs4_handle_exception(server, _nfs4_proc_layoutget(lgp),
+					    &exception);
+	} while (exception.retry);
+	return err;
+}
+
+int nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
+{
+	struct nfs4_getdeviceinfo_args args = {
+		.pdev = pdev,
+	};
+	struct nfs4_getdeviceinfo_res res = {
+		.pdev = pdev,
+	};
+	struct rpc_message msg = {
+		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_GETDEVICEINFO],
+		.rpc_argp = &args,
+		.rpc_resp = &res,
+	};
+	int status;
+
+	dprintk("--> %s\n", __func__);
+	status = nfs4_call_sync(server, &msg, &args, &res, 0);
+	dprintk("<-- %s status=%d\n", __func__, status);
+
+	return status;
+}
+EXPORT_SYMBOL_GPL(nfs4_proc_getdeviceinfo);
+
 #endif /* CONFIG_NFS_V4_1 */
 
 struct nfs4_state_recovery_ops nfs40_reboot_recovery_ops = {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 60233ae..aaf6fe5 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -52,6 +52,7 @@
 #include <linux/nfs_idmap.h>
 #include "nfs4_fs.h"
 #include "internal.h"
+#include "pnfs.h"
 
 #define NFSDBG_FACILITY		NFSDBG_XDR
 
@@ -310,6 +311,19 @@ static int nfs4_stat_to_errno(int);
 				XDR_QUADLEN(NFS4_MAX_SESSIONID_LEN) + 5)
 #define encode_reclaim_complete_maxsz	(op_encode_hdr_maxsz + 4)
 #define decode_reclaim_complete_maxsz	(op_decode_hdr_maxsz + 4)
+#define encode_getdeviceinfo_maxsz (op_encode_hdr_maxsz + 4 + \
+				XDR_QUADLEN(NFS4_PNFS_DEVICEID4_SIZE))
+#define decode_getdeviceinfo_maxsz (op_decode_hdr_maxsz + \
+				1 /* layout type */ + \
+				1 /* opaque devaddr4 length */ + \
+				  /* devaddr4 payload is read into page */ \
+				1 /* notification bitmap length */ + \
+				1 /* notification bitmap */)
+#define encode_layoutget_maxsz	(op_encode_hdr_maxsz + 10 + \
+				encode_stateid_maxsz)
+#define decode_layoutget_maxsz	(op_decode_hdr_maxsz + 8 + \
+				decode_stateid_maxsz + \
+				XDR_QUADLEN(PNFS_LAYOUT_MAXSIZE))
 #else /* CONFIG_NFS_V4_1 */
 #define encode_sequence_maxsz	0
 #define decode_sequence_maxsz	0
@@ -699,6 +713,20 @@ static int nfs4_stat_to_errno(int);
 #define NFS4_dec_reclaim_complete_sz	(compound_decode_hdr_maxsz + \
 					 decode_sequence_maxsz + \
 					 decode_reclaim_complete_maxsz)
+#define NFS4_enc_getdeviceinfo_sz (compound_encode_hdr_maxsz +    \
+				encode_sequence_maxsz +\
+				encode_getdeviceinfo_maxsz)
+#define NFS4_dec_getdeviceinfo_sz (compound_decode_hdr_maxsz +    \
+				decode_sequence_maxsz + \
+				decode_getdeviceinfo_maxsz)
+#define NFS4_enc_layoutget_sz	(compound_encode_hdr_maxsz + \
+				encode_sequence_maxsz + \
+				encode_putfh_maxsz +        \
+				encode_layoutget_maxsz)
+#define NFS4_dec_layoutget_sz	(compound_decode_hdr_maxsz + \
+				decode_sequence_maxsz + \
+				decode_putfh_maxsz +        \
+				decode_layoutget_maxsz)
 
 const u32 nfs41_maxwrite_overhead = ((RPC_MAX_HEADER_WITH_AUTH +
 				      compound_encode_hdr_maxsz +
@@ -1726,6 +1754,61 @@ static void encode_sequence(struct xdr_stream *xdr,
 #endif /* CONFIG_NFS_V4_1 */
 }
 
+#ifdef CONFIG_NFS_V4_1
+static void
+encode_getdeviceinfo(struct xdr_stream *xdr,
+		     const struct nfs4_getdeviceinfo_args *args,
+		     struct compound_hdr *hdr)
+{
+	int has_bitmap = (args->pdev->dev_notify_types != 0);
+	int len = 16 + NFS4_PNFS_DEVICEID4_SIZE + (has_bitmap * 4);
+	__be32 *p;
+
+	p = reserve_space(xdr, len);
+	*p++ = cpu_to_be32(OP_GETDEVICEINFO);
+	p = xdr_encode_opaque_fixed(p, args->pdev->dev_id.data,
+				    NFS4_PNFS_DEVICEID4_SIZE);
+	*p++ = cpu_to_be32(args->pdev->layout_type);
+	*p++ = cpu_to_be32(args->pdev->pglen);		/* gdia_maxcount */
+	*p++ = cpu_to_be32(has_bitmap);			/* bitmap length [01] */
+	if (has_bitmap)
+		*p = cpu_to_be32(args->pdev->dev_notify_types);
+	hdr->nops++;
+	hdr->replen += decode_getdeviceinfo_maxsz;
+}
+
+static void
+encode_layoutget(struct xdr_stream *xdr,
+		      const struct nfs4_layoutget_args *args,
+		      struct compound_hdr *hdr)
+{
+	nfs4_stateid stateid;
+	__be32 *p;
+
+	p = reserve_space(xdr, 44 + NFS4_STATEID_SIZE);
+	*p++ = cpu_to_be32(OP_LAYOUTGET);
+	*p++ = cpu_to_be32(0);     /* Signal layout available */
+	*p++ = cpu_to_be32(args->type);
+	*p++ = cpu_to_be32(args->range.iomode);
+	p = xdr_encode_hyper(p, args->range.offset);
+	p = xdr_encode_hyper(p, args->range.length);
+	p = xdr_encode_hyper(p, args->minlength);
+	pnfs_get_layout_stateid(&stateid, NFS_I(args->inode)->layout);
+	p = xdr_encode_opaque_fixed(p, &stateid.data, NFS4_STATEID_SIZE);
+	*p = cpu_to_be32(args->maxcount);
+
+	dprintk("%s: 1st type:0x%x iomode:%d off:%lu len:%lu mc:%d\n",
+		__func__,
+		args->type,
+		args->range.iomode,
+		(unsigned long)args->range.offset,
+		(unsigned long)args->range.length,
+		args->maxcount);
+	hdr->nops++;
+	hdr->replen += decode_layoutget_maxsz;
+}
+#endif /* CONFIG_NFS_V4_1 */
+
 /*
  * END OF "GENERIC" ENCODE ROUTINES.
  */
@@ -2543,6 +2626,51 @@ static int nfs4_xdr_enc_reclaim_complete(struct rpc_rqst *req, uint32_t *p,
 	return 0;
 }
 
+/*
+ * Encode GETDEVICEINFO request
+ */
+static int nfs4_xdr_enc_getdeviceinfo(struct rpc_rqst *req, uint32_t *p,
+				      struct nfs4_getdeviceinfo_args *args)
+{
+	struct xdr_stream xdr;
+	struct compound_hdr hdr = {
+		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
+	};
+
+	xdr_init_encode(&xdr, &req->rq_snd_buf, p);
+	encode_compound_hdr(&xdr, req, &hdr);
+	encode_sequence(&xdr, &args->seq_args, &hdr);
+	encode_getdeviceinfo(&xdr, args, &hdr);
+
+	/* set up reply kvec. Subtract notification bitmap max size (2)
+	 * so that notification bitmap is put in xdr_buf tail */
+	xdr_inline_pages(&req->rq_rcv_buf, (hdr.replen - 2) << 2,
+			 args->pdev->pages, args->pdev->pgbase,
+			 args->pdev->pglen);
+
+	encode_nops(&hdr);
+	return 0;
+}
+
+/*
+ *  Encode LAYOUTGET request
+ */
+static int nfs4_xdr_enc_layoutget(struct rpc_rqst *req, uint32_t *p,
+				  struct nfs4_layoutget_args *args)
+{
+	struct xdr_stream xdr;
+	struct compound_hdr hdr = {
+		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
+	};
+
+	xdr_init_encode(&xdr, &req->rq_snd_buf, p);
+	encode_compound_hdr(&xdr, req, &hdr);
+	encode_sequence(&xdr, &args->seq_args, &hdr);
+	encode_putfh(&xdr, NFS_FH(args->inode), &hdr);
+	encode_layoutget(&xdr, args, &hdr);
+	encode_nops(&hdr);
+	return 0;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 static void print_overflow_msg(const char *func, const struct xdr_stream *xdr)
@@ -4788,6 +4916,131 @@ out_overflow:
 #endif /* CONFIG_NFS_V4_1 */
 }
 
+#if defined(CONFIG_NFS_V4_1)
+
+static int decode_getdeviceinfo(struct xdr_stream *xdr,
+				struct pnfs_device *pdev)
+{
+	__be32 *p;
+	uint32_t len, type;
+	int status;
+
+	status = decode_op_hdr(xdr, OP_GETDEVICEINFO);
+	if (status) {
+		if (status == -ETOOSMALL) {
+			p = xdr_inline_decode(xdr, 4);
+			if (unlikely(!p))
+				goto out_overflow;
+			pdev->mincount = be32_to_cpup(p);
+			dprintk("%s: Min count too small. mincnt = %u\n",
+				__func__, pdev->mincount);
+		}
+		return status;
+	}
+
+	p = xdr_inline_decode(xdr, 8);
+	if (unlikely(!p))
+		goto out_overflow;
+	type = be32_to_cpup(p++);
+	if (type != pdev->layout_type) {
+		dprintk("%s: layout mismatch req: %u pdev: %u\n",
+			__func__, pdev->layout_type, type);
+		return -EINVAL;
+	}
+	/*
+	 * Get the length of the opaque device_addr4. xdr_read_pages places
+	 * the opaque device_addr4 in the xdr_buf->pages (pnfs_device->pages)
+	 * and places the remaining xdr data in xdr_buf->tail
+	 */
+	pdev->mincount = be32_to_cpup(p);
+	xdr_read_pages(xdr, pdev->mincount); /* include space for the length */
+
+	/*
+	 * At most one bitmap word. If the server returns a bitmap of more
+	 * than one word we ignore the extra invalid words given that
+	 * getdeviceinfo is the final operation in the compound.
+	 */
+	p = xdr_inline_decode(xdr, 4);
+	if (unlikely(!p))
+		goto out_overflow;
+	len = be32_to_cpup(p);
+	if (len) {
+		p = xdr_inline_decode(xdr, 4);
+		if (unlikely(!p))
+			goto out_overflow;
+		pdev->dev_notify_types = be32_to_cpup(p);
+	} else
+		pdev->dev_notify_types = 0;
+	return 0;
+out_overflow:
+	print_overflow_msg(__func__, xdr);
+	return -EIO;
+}
+
+static int decode_layoutget(struct xdr_stream *xdr, struct rpc_rqst *req,
+			    struct nfs4_layoutget_res *res)
+{
+	__be32 *p;
+	int status;
+	u32 layout_count;
+
+	status = decode_op_hdr(xdr, OP_LAYOUTGET);
+	if (status)
+		return status;
+	p = xdr_inline_decode(xdr, 8 + NFS4_STATEID_SIZE);
+	if (unlikely(!p))
+		goto out_overflow;
+	res->return_on_close = be32_to_cpup(p++);
+	p = xdr_decode_opaque_fixed(p, res->stateid.data, NFS4_STATEID_SIZE);
+	layout_count = be32_to_cpup(p);
+	if (!layout_count) {
+		dprintk("%s: server responded with empty layout array\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	p = xdr_inline_decode(xdr, 24);
+	if (unlikely(!p))
+		goto out_overflow;
+	p = xdr_decode_hyper(p, &res->range.offset);
+	p = xdr_decode_hyper(p, &res->range.length);
+	res->range.iomode = be32_to_cpup(p++);
+	res->type = be32_to_cpup(p++);
+
+	status = decode_opaque_inline(xdr, &res->layout.len, (char **)&p);
+	if (unlikely(status))
+		return status;
+
+	dprintk("%s roff:%lu rlen:%lu riomode:%d, lo_type:0x%x, lo.len:%d\n",
+		__func__,
+		(unsigned long)res->range.offset,
+		(unsigned long)res->range.length,
+		res->range.iomode,
+		res->type,
+		res->layout.len);
+
+	/* nfs4_proc_layoutget allocated a single page */
+	if (res->layout.len > PAGE_SIZE)
+		return -ENOMEM;
+	memcpy(res->layout.buf, p, res->layout.len);
+
+	if (layout_count > 1) {
+		/* We only handle a length one array at the moment.  Any
+		 * further entries are just ignored.  Note that this means
+		 * the client may see a response that is less than the
+		 * minimum it requested.
+		 */
+		dprintk("%s: server responded with %d layouts, dropping tail\n",
+			__func__, layout_count);
+	}
+
+	return 0;
+out_overflow:
+	print_overflow_msg(__func__, xdr);
+	return -EIO;
+}
+#endif /* CONFIG_NFS_V4_1 */
+
 /*
  * END OF "GENERIC" DECODE ROUTINES.
  */
@@ -5815,6 +6068,53 @@ static int nfs4_xdr_dec_reclaim_complete(struct rpc_rqst *rqstp, uint32_t *p,
 		status = decode_reclaim_complete(&xdr, (void *)NULL);
 	return status;
 }
+
+/*
+ * Decode GETDEVINFO response
+ */
+static int nfs4_xdr_dec_getdeviceinfo(struct rpc_rqst *rqstp, uint32_t *p,
+				      struct nfs4_getdeviceinfo_res *res)
+{
+	struct xdr_stream xdr;
+	struct compound_hdr hdr;
+	int status;
+
+	xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
+	status = decode_compound_hdr(&xdr, &hdr);
+	if (status != 0)
+		goto out;
+	status = decode_sequence(&xdr, &res->seq_res, rqstp);
+	if (status != 0)
+		goto out;
+	status = decode_getdeviceinfo(&xdr, res->pdev);
+out:
+	return status;
+}
+
+/*
+ * Decode LAYOUTGET response
+ */
+static int nfs4_xdr_dec_layoutget(struct rpc_rqst *rqstp, uint32_t *p,
+				  struct nfs4_layoutget_res *res)
+{
+	struct xdr_stream xdr;
+	struct compound_hdr hdr;
+	int status;
+
+	xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
+	status = decode_compound_hdr(&xdr, &hdr);
+	if (status)
+		goto out;
+	status = decode_sequence(&xdr, &res->seq_res, rqstp);
+	if (status)
+		goto out;
+	status = decode_putfh(&xdr);
+	if (status)
+		goto out;
+	status = decode_layoutget(&xdr, rqstp, res);
+out:
+	return status;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 __be32 *nfs4_decode_dirent(__be32 *p, struct nfs_entry *entry, int plus)
@@ -5993,6 +6293,8 @@ struct rpc_procinfo	nfs4_procedures[] = {
   PROC(SEQUENCE,	enc_sequence,	dec_sequence),
   PROC(GET_LEASE_TIME,	enc_get_lease_time,	dec_get_lease_time),
   PROC(RECLAIM_COMPLETE, enc_reclaim_complete,  dec_reclaim_complete),
+  PROC(GETDEVICEINFO, enc_getdeviceinfo, dec_getdeviceinfo),
+  PROC(LAYOUTGET,  enc_layoutget,     dec_layoutget),
 #endif /* CONFIG_NFS_V4_1 */
 };
 
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index cbce942..faf6c4c 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -128,6 +128,12 @@ pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
 		return status;
 	}
 
+	if (!io_ops->alloc_lseg || !io_ops->free_lseg) {
+		printk(KERN_ERR "%s Layout driver must provide "
+		       "alloc_lseg and free_lseg.\n", __func__);
+		return status;
+	}
+
 	spin_lock(&pnfs_spinlock);
 	if (!find_pnfs_driver_locked(ld_type->id)) {
 		list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
@@ -153,6 +159,10 @@ pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
 }
 EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
 
+/*
+ * pNFS client layout cache
+ */
+
 static void
 get_layout_hdr_locked(struct pnfs_layout_hdr *lo)
 {
@@ -175,6 +185,15 @@ put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
 	}
 }
 
+void
+put_layout_hdr(struct inode *inode)
+{
+	spin_lock(&inode->i_lock);
+	put_layout_hdr_locked(NFS_I(inode)->layout);
+	spin_unlock(&inode->i_lock);
+
+}
+
 static void
 init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
 {
@@ -191,7 +210,7 @@ destroy_lseg(struct kref *kref)
 	struct pnfs_layout_hdr *local = lseg->layout;
 
 	dprintk("--> %s\n", __func__);
-	kfree(lseg);
+	PNFS_LD_IO_OPS(local)->free_lseg(lseg);
 	/* Matched by get_layout_hdr_locked in pnfs_insert_layout */
 	put_layout_hdr_locked(local);
 }
@@ -226,6 +245,7 @@ pnfs_clear_lseg_list(struct pnfs_layout_hdr *lo)
 	/* List does not take a reference, so no need for put here */
 	list_del_init(&lo->layouts);
 	spin_unlock(&clp->cl_lock);
+	pnfs_set_layout_stateid(lo, &zero_stateid);
 
 	dprintk("%s:Return\n", __func__);
 }
@@ -268,40 +288,120 @@ pnfs_destroy_all_layouts(struct nfs_client *clp)
 	}
 }
 
-static void pnfs_insert_layout(struct pnfs_layout_hdr *lo,
-			       struct pnfs_layout_segment *lseg);
+void
+pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
+			const nfs4_stateid *stateid)
+{
+	write_seqlock(&lo->seqlock);
+	memcpy(lo->stateid.data, stateid->data, sizeof(lo->stateid.data));
+	write_sequnlock(&lo->seqlock);
+}
+
+void
+pnfs_get_layout_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo)
+{
+	int seq;
 
-/* Get layout from server. */
+	dprintk("--> %s\n", __func__);
+
+	do {
+		seq = read_seqbegin(&lo->seqlock);
+		memcpy(dst->data, lo->stateid.data,
+		       sizeof(lo->stateid.data));
+	} while (read_seqretry(&lo->seqlock, seq));
+
+	dprintk("<-- %s\n", __func__);
+}
+
+static void
+pnfs_layout_from_open_stateid(struct pnfs_layout_hdr *lo,
+			      struct nfs4_state *state)
+{
+	int seq;
+
+	dprintk("--> %s\n", __func__);
+
+	write_seqlock(&lo->seqlock);
+	/* Zero stateid, which is illegal to use in layout, is our
+	 * marker for an un-initialized stateid.
+	 */
+	if (!memcmp(lo->stateid.data, &zero_stateid, NFS4_STATEID_SIZE))
+		do {
+			seq = read_seqbegin(&state->seqlock);
+			memcpy(lo->stateid.data, state->stateid.data,
+					sizeof(state->stateid.data));
+		} while (read_seqretry(&state->seqlock, seq));
+	write_sequnlock(&lo->seqlock);
+	dprintk("<-- %s\n", __func__);
+}
+
+/*
+* Get layout from server.
+*    for now, assume that whole file layouts are requested.
+*    arg->offset: 0
+*    arg->length: all ones
+*/
 static struct pnfs_layout_segment *
 send_layoutget(struct pnfs_layout_hdr *lo,
 	   struct nfs_open_context *ctx,
 	   u32 iomode)
 {
 	struct inode *ino = lo->inode;
-	struct pnfs_layout_segment *lseg;
+	struct nfs_server *server = NFS_SERVER(ino);
+	struct nfs4_layoutget *lgp;
+	struct pnfs_layout_segment *lseg = NULL;
 
-	/* Lets pretend we sent LAYOUTGET and got a response */
-	lseg = kzalloc(sizeof(*lseg), GFP_KERNEL);
+	dprintk("--> %s\n", __func__);
+
+	BUG_ON(ctx == NULL);
+	lgp = kzalloc(sizeof(*lgp), GFP_KERNEL);
+	if (lgp == NULL) {
+		put_layout_hdr(lo->inode);
+		return NULL;
+	}
+	lgp->args.minlength = NFS4_MAX_UINT64;
+	lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
+	lgp->args.range.iomode = iomode;
+	lgp->args.range.offset = 0;
+	lgp->args.range.length = NFS4_MAX_UINT64;
+	lgp->args.type = server->pnfs_curr_ld->id;
+	lgp->args.inode = ino;
+	lgp->args.ctx = get_nfs_open_context(ctx);
+	lgp->lsegpp = &lseg;
+
+	if (!memcmp(lo->stateid.data, &zero_stateid, NFS4_STATEID_SIZE))
+		pnfs_layout_from_open_stateid(NFS_I(ino)->layout, ctx->state);
+
+	/* Synchronously retrieve layout information from server and
+	 * store in lseg.
+	 */
+	nfs4_proc_layoutget(lgp);
 	if (!lseg) {
+		/* remember that LAYOUTGET failed and suspend trying */
 		set_bit(lo_fail_bit(iomode), &lo->state);
-		spin_lock(&ino->i_lock);
-		put_layout_hdr_locked(lo);
-		spin_unlock(&ino->i_lock);
-		return NULL;
 	}
-	init_lseg(lo, lseg);
-	lseg->iomode = IOMODE_RW;
-	spin_lock(&ino->i_lock);
-	pnfs_insert_layout(lo, lseg);
-	put_layout_hdr_locked(lo);
-	spin_unlock(&ino->i_lock);
 	return lseg;
 }
 
+/*
+ * Compare two layout segments for sorting into layout cache.
+ * We want to preferentially return RW over RO layouts, so ensure those
+ * are seen first.
+ */
+static s64
+cmp_layout(u32 iomode1, u32 iomode2)
+{
+	/* read > read/write */
+	return (int)(iomode2 == IOMODE_READ) - (int)(iomode1 == IOMODE_READ);
+}
+
 static void
 pnfs_insert_layout(struct pnfs_layout_hdr *lo,
 		   struct pnfs_layout_segment *lseg)
 {
+	struct pnfs_layout_segment *lp;
+	int found = 0;
+
 	dprintk("%s:Begin\n", __func__);
 
 	assert_spin_locked(&lo->inode->i_lock);
@@ -313,13 +413,28 @@ pnfs_insert_layout(struct pnfs_layout_hdr *lo,
 		list_add_tail(&lo->layouts, &clp->cl_layouts);
 		spin_unlock(&clp->cl_lock);
 	}
-	/* STUB - add the constructed lseg if necessary */
-	if (list_empty(&lo->segs)) {
+	list_for_each_entry(lp, &lo->segs, fi_list) {
+		if (cmp_layout(lp->range.iomode, lseg->range.iomode) > 0)
+			continue;
+		list_add_tail(&lseg->fi_list, &lp->fi_list);
+		dprintk("%s: inserted lseg %p "
+			"iomode %d offset %llu length %llu before "
+			"lp %p iomode %d offset %llu length %llu\n",
+			__func__, lseg, lseg->range.iomode,
+			lseg->range.offset, lseg->range.length,
+			lp, lp->range.iomode, lp->range.offset,
+			lp->range.length);
+		found = 1;
+		break;
+	}
+	if (!found) {
 		list_add_tail(&lseg->fi_list, &lo->segs);
-		get_layout_hdr_locked(lo);
-		dprintk("%s: inserted lseg %p iomode %d at tail\n",
-			__func__, lseg, lseg->iomode);
+		dprintk("%s: inserted lseg %p "
+			"iomode %d offset %llu length %llu at tail\n",
+			__func__, lseg, lseg->range.iomode,
+			lseg->range.offset, lseg->range.length);
 	}
+	get_layout_hdr_locked(lo);
 
 	dprintk("%s:Return\n", __func__);
 }
@@ -335,6 +450,7 @@ alloc_init_layout_hdr(struct inode *ino)
 	lo->refcount = 1;
 	INIT_LIST_HEAD(&lo->layouts);
 	INIT_LIST_HEAD(&lo->segs);
+	seqlock_init(&lo->seqlock);
 	lo->inode = ino;
 	return lo;
 }
@@ -362,11 +478,46 @@ pnfs_find_alloc_layout(struct inode *ino)
 	return nfsi->layout;
 }
 
-/* STUB - LAYOUTGET never succeeds, so cache is empty */
+/*
+ * iomode matching rules:
+ * iomode	lseg	match
+ * -----	-----	-----
+ * ANY		READ	true
+ * ANY		RW	true
+ * RW		READ	false
+ * RW		RW	true
+ * READ		READ	true
+ * READ		RW	true
+ */
+static int
+is_matching_lseg(struct pnfs_layout_segment *lseg, u32 iomode)
+{
+	return (iomode != IOMODE_RW || lseg->range.iomode == IOMODE_RW);
+}
+
+/*
+ * lookup range in layout
+ */
 static struct pnfs_layout_segment *
 pnfs_has_layout(struct pnfs_layout_hdr *lo, u32 iomode)
 {
-	return NULL;
+	struct pnfs_layout_segment *lseg, *ret = NULL;
+
+	dprintk("%s:Begin\n", __func__);
+
+	assert_spin_locked(&lo->inode->i_lock);
+	list_for_each_entry(lseg, &lo->segs, fi_list) {
+		if (is_matching_lseg(lseg, iomode)) {
+			ret = lseg;
+			break;
+		}
+		if (cmp_layout(iomode, lseg->range.iomode) > 0)
+			break;
+	}
+
+	dprintk("%s:Return lseg %p ref %d\n",
+		__func__, ret, ret ? atomic_read(&ret->kref.refcount) : 0);
+	return ret;
 }
 
 /*
@@ -403,7 +554,7 @@ pnfs_update_layout(struct inode *ino,
 	if (test_bit(lo_fail_bit(iomode), &nfsi->layout->state))
 		goto out_unlock;
 
-	get_layout_hdr_locked(lo);
+	get_layout_hdr_locked(lo); /* Matched in nfs4_layoutget_release */
 	spin_unlock(&ino->i_lock);
 
 	lseg = send_layoutget(lo, ctx, iomode);
@@ -415,3 +566,184 @@ out_unlock:
 	spin_unlock(&ino->i_lock);
 	goto out;
 }
+
+int
+pnfs_layout_process(struct nfs4_layoutget *lgp)
+{
+	struct pnfs_layout_hdr *lo = NFS_I(lgp->args.inode)->layout;
+	struct nfs4_layoutget_res *res = &lgp->res;
+	struct pnfs_layout_segment *lseg;
+	struct inode *ino = lo->inode;
+	int status = 0;
+
+	/* Inject layout blob into I/O device driver */
+	lseg = PNFS_LD_IO_OPS(lo)->alloc_lseg(lo, res);
+	if (!lseg || IS_ERR(lseg)) {
+		if (!lseg)
+			status = -ENOMEM;
+		else
+			status = PTR_ERR(lseg);
+		dprintk("%s: Could not allocate layout: error %d\n",
+		       __func__, status);
+		goto out;
+	}
+
+	spin_lock(&ino->i_lock);
+	init_lseg(lo, lseg);
+	lseg->range = res->range;
+	*lgp->lsegpp = lseg;
+	pnfs_insert_layout(lo, lseg);
+
+	/* Done processing layoutget. Set the layout stateid */
+	pnfs_set_layout_stateid(lo, &res->stateid);
+	spin_unlock(&ino->i_lock);
+out:
+	return status;
+}
+
+/*
+ * Device ID cache. Currently supports one layout type per struct nfs_client.
+ * Add layout type to the lookup key to expand to support multiple types.
+ */
+int
+nfs4_alloc_init_deviceid_cache(struct nfs_client *clp,
+			 void (*free_callback)(struct nfs4_deviceid *))
+{
+	struct nfs4_deviceid_cache *c;
+
+	c = kzalloc(sizeof(struct nfs4_deviceid_cache), GFP_KERNEL);
+	if (!c)
+		return -ENOMEM;
+	spin_lock(&clp->cl_lock);
+	if (clp->cl_devid_cache != NULL) {
+		atomic_inc(&clp->cl_devid_cache->dc_ref);
+		dprintk("%s [kref [%d]]\n", __func__,
+			atomic_read(&clp->cl_devid_cache->dc_ref));
+		kfree(c);
+	} else {
+		/* kzalloc initializes hlists */
+		spin_lock_init(&c->dc_lock);
+		atomic_set(&c->dc_ref, 1);
+		c->dc_free_callback = free_callback;
+		clp->cl_devid_cache = c;
+		dprintk("%s [new]\n", __func__);
+	}
+	spin_unlock(&clp->cl_lock);
+	return 0;
+}
+EXPORT_SYMBOL(nfs4_alloc_init_deviceid_cache);
+
+void
+nfs4_init_deviceid_node(struct nfs4_deviceid *d)
+{
+	INIT_HLIST_NODE(&d->de_node);
+	atomic_set(&d->de_ref, 1);
+}
+EXPORT_SYMBOL(nfs4_init_deviceid_node);
+
+/* Called from layoutdriver_io_operations->alloc_lseg */
+void
+nfs4_set_layout_deviceid(struct pnfs_layout_segment *l, struct nfs4_deviceid *d)
+{
+	dprintk("%s [%d]\n", __func__, atomic_read(&d->de_ref));
+	l->deviceid = d;
+}
+EXPORT_SYMBOL(nfs4_set_layout_deviceid);
+
+/*
+ * Called from layoutdriver_io_operations->free_lseg
+ * last layout segment reference frees deviceid
+ */
+void
+nfs4_put_layout_deviceid(struct pnfs_layout_segment *l)
+{
+	struct nfs4_deviceid_cache *c =
+		NFS_SERVER(l->layout->inode)->nfs_client->cl_devid_cache;
+	struct pnfs_deviceid *id = &l->deviceid->de_id;
+	struct nfs4_deviceid *d;
+	struct hlist_node *n;
+	long h = nfs4_deviceid_hash(id);
+
+	dprintk("%s [%d]\n", __func__, atomic_read(&l->deviceid->de_ref));
+	if (!atomic_dec_and_lock(&l->deviceid->de_ref, &c->dc_lock))
+		return;
+
+	hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[h], de_node)
+		if (!memcmp(&d->de_id, id, sizeof(*id))) {
+			hlist_del_rcu(&d->de_node);
+			spin_unlock(&c->dc_lock);
+			synchronize_rcu();
+			c->dc_free_callback(l->deviceid);
+			return;
+		}
+	spin_unlock(&c->dc_lock);
+}
+EXPORT_SYMBOL(nfs4_put_layout_deviceid);
+
+/* Find and reference a deviceid */
+struct nfs4_deviceid *
+nfs4_find_get_deviceid(struct nfs4_deviceid_cache *c, struct pnfs_deviceid *id)
+{
+	struct nfs4_deviceid *d;
+	struct hlist_node *n;
+	long hash = nfs4_deviceid_hash(id);
+
+	dprintk("--> %s hash %ld\n", __func__, hash);
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[hash], de_node) {
+		if (!memcmp(&d->de_id, id, sizeof(*id))) {
+			if (!atomic_inc_not_zero(&d->de_ref)) {
+				goto fail;
+			} else {
+				rcu_read_unlock();
+				return d;
+			}
+		}
+	}
+fail:
+	rcu_read_unlock();
+	return NULL;
+}
+EXPORT_SYMBOL(nfs4_find_get_deviceid);
+
+/*
+ * Add a deviceid to the cache.
+ * GETDEVICEINFOs for same deviceid can race. If deviceid is found, discard new
+ */
+struct nfs4_deviceid *
+nfs4_add_deviceid(struct nfs4_deviceid_cache *c, struct nfs4_deviceid *new)
+{
+	struct nfs4_deviceid *d;
+	struct hlist_node *n;
+	long hash = nfs4_deviceid_hash(&new->de_id);
+
+	dprintk("--> %s hash %ld\n", __func__, hash);
+	spin_lock(&c->dc_lock);
+	hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[hash], de_node) {
+		if (!memcmp(&d->de_id, &new->de_id, sizeof(new->de_id))) {
+			spin_unlock(&c->dc_lock);
+			dprintk("%s [discard]\n", __func__);
+			c->dc_free_callback(new);
+			return d;
+		}
+	}
+	hlist_add_head_rcu(&new->de_node, &c->dc_deviceids[hash]);
+	spin_unlock(&c->dc_lock);
+	dprintk("%s [new]\n", __func__);
+	return new;
+}
+EXPORT_SYMBOL(nfs4_add_deviceid);
+
+void
+nfs4_put_deviceid_cache(struct nfs_client *clp)
+{
+	struct nfs4_deviceid_cache *local = clp->cl_devid_cache;
+
+	dprintk("--> %s cl_devid_cache %p\n", __func__, clp->cl_devid_cache);
+	if (atomic_dec_and_lock(&local->dc_ref, &clp->cl_lock)) {
+		clp->cl_devid_cache = NULL;
+		spin_unlock(&clp->cl_lock);
+		kfree(local);
+	}
+}
+EXPORT_SYMBOL(nfs4_put_deviceid_cache);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index dac6a72..d343f83 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -12,11 +12,14 @@
 
 struct pnfs_layout_segment {
 	struct list_head fi_list;
-	u32 iomode;
+	struct pnfs_layout_range range;
 	struct kref kref;
 	struct pnfs_layout_hdr *layout;
+	struct nfs4_deviceid *deviceid;
 };
 
+#define NFS4_PNFS_DEVICEID4_SIZE 16
+
 #ifdef CONFIG_NFS_V4_1
 
 #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
@@ -38,17 +41,86 @@ struct pnfs_layout_hdr {
 	int			refcount;
 	struct list_head	layouts;   /* other client layouts */
 	struct list_head	segs;      /* layout segments list */
+	seqlock_t		seqlock;   /* Protects the stateid */
+	nfs4_stateid		stateid;
 	unsigned long		state;
 	struct inode		*inode;
 };
 
 /* Layout driver I/O operations. */
 struct layoutdriver_io_operations {
+	struct pnfs_layout_segment * (*alloc_lseg) (struct pnfs_layout_hdr *layoutid, struct nfs4_layoutget_res *lgr);
+	void (*free_lseg) (struct pnfs_layout_segment *lseg);
+
 	/* Registration information for a new mounted file system */
 	int (*initialize_mountpoint) (struct nfs_client *);
 	int (*uninitialize_mountpoint) (struct nfs_client *);
 };
 
+struct pnfs_deviceid {
+	char data[NFS4_PNFS_DEVICEID4_SIZE];
+};
+
+struct pnfs_device {
+	struct pnfs_deviceid dev_id;
+	unsigned int  layout_type;
+	unsigned int  mincount;
+	struct page **pages;
+	void          *area;
+	unsigned int  pgbase;
+	unsigned int  pglen;
+	unsigned int  dev_notify_types;
+};
+
+/*
+ * Device ID RCU cache. A device ID is unique per client ID and layout type.
+ */
+#define NFS4_DEVICE_ID_HASH_BITS	5
+#define NFS4_DEVICE_ID_HASH_SIZE	(1 << NFS4_DEVICE_ID_HASH_BITS)
+#define NFS4_DEVICE_ID_HASH_MASK	(NFS4_DEVICE_ID_HASH_SIZE - 1)
+
+static inline u32
+nfs4_deviceid_hash(struct pnfs_deviceid *id)
+{
+	unsigned char *cptr = (unsigned char *)id->data;
+	unsigned int nbytes = NFS4_PNFS_DEVICEID4_SIZE;
+	u32 x = 0;
+
+	while (nbytes--) {
+		x *= 37;
+		x += *cptr++;
+	}
+	return x & NFS4_DEVICE_ID_HASH_MASK;
+}
+
+struct nfs4_deviceid_cache {
+	spinlock_t		dc_lock;
+	atomic_t		dc_ref;
+	void			(*dc_free_callback)(struct nfs4_deviceid *);
+	struct hlist_head	dc_deviceids[NFS4_DEVICE_ID_HASH_SIZE];
+	struct hlist_head	dc_to_free;
+};
+
+/* Device ID cache node */
+struct nfs4_deviceid {
+	struct hlist_node	de_node;
+	struct pnfs_deviceid	de_id;
+	atomic_t		de_ref;
+};
+
+extern int nfs4_alloc_init_deviceid_cache(struct nfs_client *,
+				void (*free_callback)(struct nfs4_deviceid *));
+extern void nfs4_put_deviceid_cache(struct nfs_client *);
+extern void nfs4_init_deviceid_node(struct nfs4_deviceid *);
+extern struct nfs4_deviceid *nfs4_find_get_deviceid(
+				struct nfs4_deviceid_cache *,
+				struct pnfs_deviceid *);
+extern struct nfs4_deviceid *nfs4_add_deviceid(struct nfs4_deviceid_cache *,
+				struct nfs4_deviceid *);
+extern void nfs4_set_layout_deviceid(struct pnfs_layout_segment *,
+				struct nfs4_deviceid *);
+extern void nfs4_put_layout_deviceid(struct pnfs_layout_segment *);
+
 extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
 extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
 
@@ -58,13 +130,30 @@ PNFS_NFS_SERVER(struct pnfs_layout_hdr *lo)
 	return NFS_SERVER(lo->inode);
 }
 
+static inline struct layoutdriver_io_operations *
+PNFS_LD_IO_OPS(struct pnfs_layout_hdr *lo)
+{
+	return PNFS_NFS_SERVER(lo)->pnfs_curr_ld->ld_io_ops;
+}
+
+/* nfs4proc.c */
+extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
+				   struct pnfs_device *dev);
+extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
+
+/* pnfs.c */
 struct pnfs_layout_segment *
 pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
 		   enum pnfs_iomode access_type);
 void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
 void unset_pnfs_layoutdriver(struct nfs_server *);
+int pnfs_layout_process(struct nfs4_layoutget *lgp);
+void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
+			     const nfs4_stateid *stateid);
 void pnfs_destroy_layout(struct nfs_inode *);
 void pnfs_destroy_all_layouts(struct nfs_client *);
+void put_layout_hdr(struct inode *inode);
+void pnfs_get_layout_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo);
 
 
 static inline int lo_fail_bit(u32 iomode)
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 2dde7c8..dcdd11c 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -545,6 +545,8 @@ enum {
 	NFSPROC4_CLNT_SEQUENCE,
 	NFSPROC4_CLNT_GET_LEASE_TIME,
 	NFSPROC4_CLNT_RECLAIM_COMPLETE,
+	NFSPROC4_CLNT_LAYOUTGET,
+	NFSPROC4_CLNT_GETDEVICEINFO,
 };
 
 /* nfs41 types */
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index e670a9c..7512886 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -83,6 +83,7 @@ struct nfs_client {
 	u32			cl_exchange_flags;
 	struct nfs4_session	*cl_session; 	/* sharred session */
 	struct list_head	cl_layouts;
+	struct nfs4_deviceid_cache *cl_devid_cache; /* pNFS deviceid cache */
 #endif /* CONFIG_NFS_V4_1 */
 
 #ifdef CONFIG_NFS_FSCACHE
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 8a2c228..c4c6a61 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -186,6 +186,55 @@ struct nfs4_get_lease_time_res {
 	struct nfs4_sequence_res	lr_seq_res;
 };
 
+#define PNFS_LAYOUT_MAXSIZE 4096
+
+struct nfs4_layoutdriver_data {
+	__u32 len;
+	void *buf;
+};
+
+struct pnfs_layout_range {
+	u32 iomode;
+	u64 offset;
+	u64 length;
+};
+
+struct nfs4_layoutget_args {
+	__u32 type;
+	struct pnfs_layout_range range;
+	__u64 minlength;
+	__u32 maxcount;
+	struct inode *inode;
+	struct nfs_open_context *ctx;
+	struct nfs4_sequence_args seq_args;
+};
+
+struct nfs4_layoutget_res {
+	__u32 return_on_close;
+	struct pnfs_layout_range range;
+	__u32 type;
+	nfs4_stateid stateid;
+	struct nfs4_layoutdriver_data layout;
+	struct nfs4_sequence_res seq_res;
+};
+
+struct nfs4_layoutget {
+	struct nfs4_layoutget_args args;
+	struct nfs4_layoutget_res res;
+	struct pnfs_layout_segment **lsegpp;
+	int status;
+};
+
+struct nfs4_getdeviceinfo_args {
+	struct pnfs_device *pdev;
+	struct nfs4_sequence_args seq_args;
+};
+
+struct nfs4_getdeviceinfo_res {
+	struct pnfs_device *pdev;
+	struct nfs4_sequence_res seq_res;
+};
+
 /*
  * Arguments to the open call.
  */
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 13/13] RFC: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure
  2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
                   ` (11 preceding siblings ...)
  2010-09-02 18:00 ` [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure Fred Isaman
@ 2010-09-02 18:00 ` Fred Isaman
  2010-09-10 20:33   ` Trond Myklebust
  12 siblings, 1 reply; 55+ messages in thread
From: Fred Isaman @ 2010-09-02 18:00 UTC (permalink / raw)
  To: linux-nfs

From: The pNFS Team <linux-nfs@vger.kernel.org>

Implement the driver's io_ops->alloc_lseg and free_lseg functions,
which integrate into the deviceid cache and calls out to
nfs4_proc_getdeviceinfo when necessary.

Signed-off-by: TBD - melding/reorganization of several patches
---
 fs/nfs/Makefile            |    2 +-
 fs/nfs/client.c            |    1 +
 fs/nfs/nfs4filelayout.c    |  203 ++++++++++++++++++++-
 fs/nfs/nfs4filelayout.h    |   74 +++++++
 fs/nfs/nfs4filelayoutdev.c |  450 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 728 insertions(+), 2 deletions(-)
 create mode 100644 fs/nfs/nfs4filelayout.h
 create mode 100644 fs/nfs/nfs4filelayoutdev.c

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 08a8889..4776ff9 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -20,4 +20,4 @@ nfs-$(CONFIG_SYSCTL) += sysctl.o
 nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
 
 obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
-nfs_layout_nfsv41_files-y := nfs4filelayout.o
+nfs_layout_nfsv41_files-y := nfs4filelayout.o nfs4filelayoutdev.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6fc5c84..bac8ac2 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -255,6 +255,7 @@ void nfs_put_client(struct nfs_client *clp)
 		nfs_free_client(clp);
 	}
 }
+EXPORT_SYMBOL(nfs_put_client);
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 /*
diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index c685196..0104d09 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -30,7 +30,9 @@
  */
 
 #include <linux/nfs_fs.h>
-#include "pnfs.h"
+
+#include "internal.h"
+#include "nfs4filelayout.h"
 
 #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
 
@@ -41,18 +43,217 @@ MODULE_DESCRIPTION("The NFSv4 file layout driver");
 int
 filelayout_initialize_mountpoint(struct nfs_client *clp)
 {
+	int status = nfs4_alloc_init_deviceid_cache(clp,
+						nfs4_fl_free_deviceid_callback);
+	if (status) {
+		printk(KERN_WARNING "%s: deviceid cache could not be "
+			"initialized\n", __func__);
+		return status;
+	}
+	dprintk("%s: deviceid cache has been initialized successfully\n",
+		__func__);
 	return 0;
 }
 
+/* Uninitialize a mountpoint by destroying its device list */
 int
 filelayout_uninitialize_mountpoint(struct nfs_client *clp)
 {
 	dprintk("--> %s\n", __func__);
 
+	if (clp->cl_devid_cache)
+		nfs4_put_deviceid_cache(clp);
+	return 0;
+}
+
+/*
+ * filelayout_check_layout()
+ *
+ * Make sure layout segment parameters are sane WRT the device.
+ * At this point no generic layer initialization of the lseg has occurred,
+ * and nothing has been added to the layout_hdr cache.
+ *
+ */
+static int
+filelayout_check_layout(struct pnfs_layout_hdr *lo,
+			struct nfs4_filelayout_segment *fl,
+			struct nfs4_layoutget_res *lgr)
+{
+	struct pnfs_layout_segment *lseg = &fl->generic_hdr;
+	struct nfs4_file_layout_dsaddr *dsaddr;
+	int status = -EINVAL;
+	struct nfs_server *nfss = PNFS_NFS_SERVER(lo);
+
+	dprintk("--> %s\n", __func__);
+
+	if (fl->pattern_offset > lgr->range.offset) {
+		dprintk("%s pattern_offset %lld to large\n",
+				__func__, fl->pattern_offset);
+		goto out;
+	}
+
+	if (fl->stripe_unit % PAGE_SIZE) {
+		dprintk("%s Stripe unit (%u) not page aligned\n",
+			__func__, fl->stripe_unit);
+		goto out;
+	}
+
+	/* find and reference the deviceid */
+	dsaddr = nfs4_fl_find_get_deviceid(nfss->nfs_client, &fl->dev_id);
+	if (dsaddr == NULL) {
+		dsaddr = get_device_info(lo->inode, &fl->dev_id);
+		if (dsaddr == NULL)
+			goto out;
+	}
+
+	nfs4_set_layout_deviceid(lseg, &dsaddr->deviceid);
+
+	if (fl->first_stripe_index < 0 ||
+	    fl->first_stripe_index >= dsaddr->stripe_count) {
+		dprintk("%s Bad first_stripe_index %d\n",
+				__func__, fl->first_stripe_index);
+		goto out_put;
+	}
+
+	if ((fl->stripe_type == STRIPE_SPARSE &&
+	    fl->num_fh > 1 && fl->num_fh != dsaddr->ds_num) ||
+	    (fl->stripe_type == STRIPE_DENSE &&
+	    fl->num_fh != dsaddr->stripe_count)) {
+		dprintk("%s num_fh %u not valid for given packing\n",
+			__func__, fl->num_fh);
+		goto out_put;
+	}
+
+	if (fl->stripe_unit % nfss->rsize || fl->stripe_unit % nfss->wsize) {
+		dprintk("%s Stripe unit (%u) not aligned with rsize %u "
+			"wsize %u\n", __func__, fl->stripe_unit, nfss->rsize,
+			nfss->wsize);
+	}
+
+	status = 0;
+out:
+	dprintk("--> %s returns %d\n", __func__, status);
+	return status;
+out_put:
+	nfs4_put_layout_deviceid(lseg);
+	goto out;
+}
+
+static void _filelayout_free_lseg(struct nfs4_filelayout_segment *fl);
+static void filelayout_free_fh_array(struct nfs4_filelayout_segment *fl);
+
+static int
+filelayout_decode_layout(struct pnfs_layout_hdr *flo,
+		      struct nfs4_filelayout_segment *fl,
+		      struct nfs4_layoutget_res *lgr)
+{
+	uint32_t *p = (uint32_t *)lgr->layout.buf;
+	uint32_t nfl_util;
+	int i;
+
+	dprintk("%s: set_layout_map Begin\n", __func__);
+
+	memcpy(&fl->dev_id, p, sizeof(fl->dev_id));
+	p += XDR_QUADLEN(NFS4_PNFS_DEVICEID4_SIZE);
+	print_deviceid(&fl->dev_id);
+
+	nfl_util = be32_to_cpup(p++);
+	if (nfl_util & NFL4_UFLG_COMMIT_THRU_MDS)
+		fl->commit_through_mds = 1;
+	if (nfl_util & NFL4_UFLG_DENSE)
+		fl->stripe_type = STRIPE_DENSE;
+	else
+		fl->stripe_type = STRIPE_SPARSE;
+	fl->stripe_unit = nfl_util & ~NFL4_UFLG_MASK;
+
+	fl->first_stripe_index = be32_to_cpup(p++);
+	p = xdr_decode_hyper(p, &fl->pattern_offset);
+	fl->num_fh = be32_to_cpup(p++);
+
+	dprintk("%s: nfl_util 0x%X num_fh %u fsi %u po %llu\n",
+		__func__, nfl_util, fl->num_fh, fl->first_stripe_index,
+		fl->pattern_offset);
+
+	if (fl->num_fh * sizeof(struct nfs_fh) > 2*PAGE_SIZE) {
+		fl->fh_array = vmalloc(fl->num_fh * sizeof(struct nfs_fh));
+		if (fl->fh_array)
+			memset(fl->fh_array, 0,
+				fl->num_fh * sizeof(struct nfs_fh));
+	} else {
+		fl->fh_array = kzalloc(fl->num_fh * sizeof(struct nfs_fh),
+					GFP_KERNEL);
+	}
+	if (!fl->fh_array)
+		return -ENOMEM;
+
+	for (i = 0; i < fl->num_fh; i++) {
+		/* fh */
+		fl->fh_array[i].size = be32_to_cpup(p++);
+		if (sizeof(struct nfs_fh) < fl->fh_array[i].size) {
+			printk(KERN_ERR "Too big fh %d received %d\n",
+				i, fl->fh_array[i].size);
+			/* Layout is now invalid, pretend it doesn't exist */
+			filelayout_free_fh_array(fl);
+			fl->num_fh = 0;
+			break;
+		}
+		memcpy(fl->fh_array[i].data, p, fl->fh_array[i].size);
+		p += XDR_QUADLEN(fl->fh_array[i].size);
+		dprintk("DEBUG: %s: fh len %d\n", __func__,
+					fl->fh_array[i].size);
+	}
+
 	return 0;
 }
 
+static struct pnfs_layout_segment *
+filelayout_alloc_lseg(struct pnfs_layout_hdr *layoutid,
+		      struct nfs4_layoutget_res *lgr)
+{
+	struct nfs4_filelayout_segment *fl;
+	int rc;
+
+	dprintk("--> %s\n", __func__);
+	fl = kzalloc(sizeof(*fl), GFP_KERNEL);
+	if (!fl)
+		return NULL;
+
+	rc = filelayout_decode_layout(layoutid, fl, lgr);
+	if (rc != 0 || filelayout_check_layout(layoutid, fl, lgr)) {
+		_filelayout_free_lseg(fl);
+		return NULL;
+	}
+	return &fl->generic_hdr;
+}
+
+static void filelayout_free_fh_array(struct nfs4_filelayout_segment *fl)
+{
+	if (fl->num_fh * sizeof(struct nfs_fh) > 2*PAGE_SIZE)
+		vfree(fl->fh_array);
+	else
+		kfree(fl->fh_array);
+
+	fl->fh_array = NULL;
+}
+
+static void
+_filelayout_free_lseg(struct nfs4_filelayout_segment *fl)
+{
+	filelayout_free_fh_array(fl);
+	kfree(fl);
+}
+
+static void
+filelayout_free_lseg(struct pnfs_layout_segment *lseg)
+{
+	dprintk("--> %s\n", __func__);
+	nfs4_put_layout_deviceid(lseg);
+	_filelayout_free_lseg(FILELAYOUT_LSEG(lseg));
+}
+
 struct layoutdriver_io_operations filelayout_io_operations = {
+	.alloc_lseg              = filelayout_alloc_lseg,
+	.free_lseg               = filelayout_free_lseg,
 	.initialize_mountpoint   = filelayout_initialize_mountpoint,
 	.uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
 };
diff --git a/fs/nfs/nfs4filelayout.h b/fs/nfs/nfs4filelayout.h
new file mode 100644
index 0000000..2467b5f
--- /dev/null
+++ b/fs/nfs/nfs4filelayout.h
@@ -0,0 +1,74 @@
+/*
+ *  NFSv4 file layout driver data structures.
+ *
+ *  Copyright (c) 2002 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Dean Hildebrand   <dhildebz@umich.edu>
+ */
+
+#ifndef FS_NFS_NFS4FILELAYOUT_H
+#define FS_NFS_NFS4FILELAYOUT_H
+
+#include "pnfs.h"
+
+/*
+ * Field testing shows we need to support upto 4096 stripe indices.
+ * We store each index as a u8 (u32 on the wire) to keep the memory footprint
+ * reasonable. This in turn means we support a maximum of 256
+ * RFC 5661 multipath_list4 structures.
+ */
+#define NFS4_PNFS_MAX_STRIPE_CNT 4096
+#define NFS4_PNFS_MAX_MULTI_CNT  256 /* 256 fit into a u8 stripe_index */
+
+enum stripetype4 {
+	STRIPE_SPARSE = 1,
+	STRIPE_DENSE = 2
+};
+
+/* Individual ip address */
+struct nfs4_pnfs_ds {
+	struct list_head	ds_node;  /* nfs4_pnfs_dev_hlist dev_dslist */
+	u32			ds_ip_addr;
+	u32			ds_port;
+	struct nfs_client	*ds_clp;
+	atomic_t		ds_count;
+};
+
+struct nfs4_file_layout_dsaddr {
+	struct nfs4_deviceid	deviceid;
+	u32			stripe_count;
+	u8			*stripe_indices;
+	u32			ds_num;
+	struct nfs4_pnfs_ds	*ds_list[1];
+};
+
+struct nfs4_filelayout_segment {
+	struct pnfs_layout_segment generic_hdr;
+	u32 stripe_type;
+	u32 commit_through_mds;
+	u32 stripe_unit;
+	u32 first_stripe_index;
+	u64 pattern_offset;
+	struct pnfs_deviceid dev_id;
+	unsigned int num_fh;
+	struct nfs_fh *fh_array;
+};
+
+static inline struct nfs4_filelayout_segment *
+FILELAYOUT_LSEG(struct pnfs_layout_segment *lseg)
+{
+	return container_of(lseg,
+			    struct nfs4_filelayout_segment,
+			    generic_hdr);
+}
+
+extern void nfs4_fl_free_deviceid_callback(struct nfs4_deviceid *);
+extern void print_ds(struct nfs4_pnfs_ds *ds);
+extern void print_deviceid(struct pnfs_deviceid *dev_id);
+extern struct nfs4_file_layout_dsaddr *
+nfs4_fl_find_get_deviceid(struct nfs_client *, struct pnfs_deviceid *dev_id);
+struct nfs4_file_layout_dsaddr *
+get_device_info(struct inode *inode, struct pnfs_deviceid *dev_id);
+
+#endif /* FS_NFS_NFS4FILELAYOUT_H */
diff --git a/fs/nfs/nfs4filelayoutdev.c b/fs/nfs/nfs4filelayoutdev.c
new file mode 100644
index 0000000..833ff9a
--- /dev/null
+++ b/fs/nfs/nfs4filelayoutdev.c
@@ -0,0 +1,450 @@
+/*
+ *  Device operations for the pnfs nfs4 file layout driver.
+ *
+ *  Copyright (c) 2002
+ *  The Regents of the University of Michigan
+ *  All Rights Reserved
+ *
+ *  Dean Hildebrand <dhildebz@umich.edu>
+ *  Garth Goodson   <Garth.Goodson@netapp.com>
+ *
+ *  Permission is granted to use, copy, create derivative works, and
+ *  redistribute this software and such derivative works for any purpose,
+ *  so long as the name of the University of Michigan is not used in
+ *  any advertising or publicity pertaining to the use or distribution
+ *  of this software without specific, written prior authorization. If
+ *  the above copyright notice or any other identification of the
+ *  University of Michigan is included in any copy of any portion of
+ *  this software, then the disclaimer below must also be included.
+ *
+ *  This software is provided as is, without representation or warranty
+ *  of any kind either express or implied, including without limitation
+ *  the implied warranties of merchantability, fitness for a particular
+ *  purpose, or noninfringement.  The Regents of the University of
+ *  Michigan shall not be liable for any damages, including special,
+ *  indirect, incidental, or consequential damages, with respect to any
+ *  claim arising out of or in connection with the use of the software,
+ *  even if it has been or is hereafter advised of the possibility of
+ *  such damages.
+ */
+
+#include <linux/nfs_fs.h>
+
+#include "internal.h"
+#include "nfs4filelayout.h"
+
+#define NFSDBG_FACILITY		NFSDBG_PNFS_LD
+
+/*
+ * Data server cache
+ *
+ * Data servers can be mapped to different device ids.
+ * nfs4_pnfs_ds reference counting
+ *   - set to 1 on allocation
+ *   - incremented when a device id maps a data server already in the cache.
+ *   - decremented when deviceid is removed from the cache.
+ */
+DEFINE_SPINLOCK(nfs4_ds_cache_lock);
+static LIST_HEAD(nfs4_data_server_cache);
+
+/* Debug routines */
+void
+print_ds(struct nfs4_pnfs_ds *ds)
+{
+	if (ds == NULL) {
+		dprintk("%s NULL device\n", __func__);
+		return;
+	}
+	dprintk("        ip_addr %x port %hu\n"
+		"        ref count %d\n"
+		"        client %p\n"
+		"        cl_exchange_flags %x\n",
+		ntohl(ds->ds_ip_addr), ntohs(ds->ds_port),
+		atomic_read(&ds->ds_count), ds->ds_clp,
+		ds->ds_clp ? ds->ds_clp->cl_exchange_flags : 0);
+}
+
+void
+print_ds_list(struct nfs4_file_layout_dsaddr *dsaddr)
+{
+	int i;
+
+	dprintk("%s dsaddr->ds_num %d\n", __func__,
+		dsaddr->ds_num);
+	for (i = 0; i < dsaddr->ds_num; i++)
+		print_ds(dsaddr->ds_list[i]);
+}
+
+void print_deviceid(struct pnfs_deviceid *id)
+{
+	u32 *p = (u32 *)id;
+
+	dprintk("%s: device id= [%x%x%x%x]\n", __func__,
+		p[0], p[1], p[2], p[3]);
+}
+
+/* nfs4_ds_cache_lock is held */
+static struct nfs4_pnfs_ds *
+_data_server_lookup_locked(u32 ip_addr, u32 port)
+{
+	struct nfs4_pnfs_ds *ds;
+
+	dprintk("_data_server_lookup: ip_addr=%x port=%hu\n",
+			ntohl(ip_addr), ntohs(port));
+
+	list_for_each_entry(ds, &nfs4_data_server_cache, ds_node) {
+		if (ds->ds_ip_addr == ip_addr &&
+		    ds->ds_port == port) {
+			return ds;
+		}
+	}
+	return NULL;
+}
+
+static void
+destroy_ds(struct nfs4_pnfs_ds *ds)
+{
+	dprintk("--> %s\n", __func__);
+	print_ds(ds);
+
+	if (ds->ds_clp)
+		nfs_put_client(ds->ds_clp);
+	kfree(ds);
+}
+
+static void
+nfs4_fl_free_deviceid(struct nfs4_file_layout_dsaddr *dsaddr)
+{
+	struct nfs4_pnfs_ds *ds;
+	int i;
+
+	print_deviceid(&dsaddr->deviceid.de_id);
+
+	for (i = 0; i < dsaddr->ds_num; i++) {
+		ds = dsaddr->ds_list[i];
+		if (ds != NULL) {
+			if (atomic_dec_and_lock(&ds->ds_count,
+						&nfs4_ds_cache_lock)) {
+				list_del_init(&ds->ds_node);
+				spin_unlock(&nfs4_ds_cache_lock);
+				destroy_ds(ds);
+			}
+		}
+	}
+	kfree(dsaddr->stripe_indices);
+	kfree(dsaddr);
+}
+
+void
+nfs4_fl_free_deviceid_callback(struct nfs4_deviceid *device)
+{
+	struct nfs4_file_layout_dsaddr *dsaddr =
+		container_of(device, struct nfs4_file_layout_dsaddr, deviceid);
+
+	nfs4_fl_free_deviceid(dsaddr);
+}
+
+static struct nfs4_pnfs_ds *
+nfs4_pnfs_ds_add(struct inode *inode, u32 ip_addr, u32 port)
+{
+	struct nfs4_pnfs_ds *tmp_ds, *ds;
+
+	ds = kzalloc(sizeof(*tmp_ds), GFP_KERNEL);
+	if (!ds)
+		goto out;
+
+	spin_lock(&nfs4_ds_cache_lock);
+	tmp_ds = _data_server_lookup_locked(ip_addr, port);
+	if (tmp_ds == NULL) {
+		ds->ds_ip_addr = ip_addr;
+		ds->ds_port = port;
+		atomic_set(&ds->ds_count, 1);
+		INIT_LIST_HEAD(&ds->ds_node);
+		ds->ds_clp = NULL;
+		list_add(&ds->ds_node, &nfs4_data_server_cache);
+		dprintk("%s add new data server ip 0x%x\n", __func__,
+			ds->ds_ip_addr);
+	} else {
+		kfree(ds);
+		atomic_inc(&tmp_ds->ds_count);
+		dprintk("%s data server found ip 0x%x, inc'ed ds_count to %d\n",
+			__func__, tmp_ds->ds_ip_addr,
+			atomic_read(&tmp_ds->ds_count));
+		ds = tmp_ds;
+	}
+	spin_unlock(&nfs4_ds_cache_lock);
+out:
+	return ds;
+}
+
+/*
+ * Currently only support ipv4, and one multi-path address.
+ */
+static struct nfs4_pnfs_ds *
+decode_and_add_ds(__be32 **pp, struct inode *inode)
+{
+	struct nfs4_pnfs_ds *ds = NULL;
+	char *buf;
+	const char *ipend, *pstr;
+	u32 ip_addr, port;
+	int nlen, rlen, i;
+	int tmp[2];
+	__be32 *r_netid, *r_addr, *p = *pp;
+
+	/* r_netid */
+	nlen = be32_to_cpup(p++);
+	r_netid = p;
+	p += XDR_QUADLEN(nlen);
+
+	/* r_addr */
+	rlen = be32_to_cpup(p++);
+	r_addr = p;
+	p += XDR_QUADLEN(rlen);
+	*pp = p;
+
+	/* Check that netid is "tcp" */
+	if (nlen != 3 ||  memcmp((char *)r_netid, "tcp", 3)) {
+		dprintk("%s: ERROR: non ipv4 TCP r_netid\n", __func__);
+		goto out_err;
+	}
+
+	/* ipv6 length plus port is legal */
+	if (rlen > INET6_ADDRSTRLEN + 8) {
+		dprintk("%s Invalid address, length %d\n", __func__,
+			rlen);
+		goto out_err;
+	}
+	buf = kmalloc(rlen + 1, GFP_KERNEL);
+	buf[rlen] = '\0';
+	memcpy(buf, r_addr, rlen);
+
+	/* replace the port dots with dashes for the in4_pton() delimiter*/
+	for (i = 0; i < 2; i++) {
+		char *res = strrchr(buf, '.');
+		*res = '-';
+	}
+
+	/* Currently only support ipv4 address */
+	if (in4_pton(buf, rlen, (u8 *)&ip_addr, '-', &ipend) == 0) {
+		dprintk("%s: Only ipv4 addresses supported\n", __func__);
+		goto out_free;
+	}
+
+	/* port */
+	pstr = ipend;
+	sscanf(pstr, "-%d-%d", &tmp[0], &tmp[1]);
+	port = htons((tmp[0] << 8) | (tmp[1]));
+
+	ds = nfs4_pnfs_ds_add(inode, ip_addr, port);
+	dprintk("%s Decoded address and port %s\n", __func__, buf);
+out_free:
+	kfree(buf);
+out_err:
+	return ds;
+}
+
+
+
+/*Decode opaque device data and return the result */
+static struct nfs4_file_layout_dsaddr*
+decode_device(struct inode *ino, struct pnfs_device *pdev)
+{
+	int i, dummy;
+	u32 cnt, num;
+	u8 *indexp;
+	__be32 *p = (__be32 *)pdev->area, *indicesp;
+	struct nfs4_file_layout_dsaddr *dsaddr;
+
+	/* Get the stripe count (number of stripe index) */
+	cnt = be32_to_cpup(p++);
+	dprintk("%s stripe count  %d\n", __func__, cnt);
+	if (cnt > NFS4_PNFS_MAX_STRIPE_CNT) {
+		printk(KERN_WARNING "%s: stripe count %d greater than "
+		       "supported maximum %d\n", __func__,
+			cnt, NFS4_PNFS_MAX_STRIPE_CNT);
+		goto out_err;
+	}
+
+	/* Check the multipath list count */
+	indicesp = p;
+	p += XDR_QUADLEN(cnt << 2);
+	num = be32_to_cpup(p++);
+	dprintk("%s ds_num %u\n", __func__, num);
+	if (num > NFS4_PNFS_MAX_MULTI_CNT) {
+		printk(KERN_WARNING "%s: multipath count %d greater than "
+			"supported maximum %d\n", __func__,
+			num, NFS4_PNFS_MAX_MULTI_CNT);
+		goto out_err;
+	}
+	dsaddr = kzalloc(sizeof(*dsaddr) +
+			(sizeof(struct nfs4_pnfs_ds *) * (num - 1)),
+			GFP_KERNEL);
+	if (!dsaddr)
+		goto out_err;
+
+	dsaddr->stripe_indices = kzalloc(sizeof(u8) * cnt, GFP_KERNEL);
+	if (!dsaddr->stripe_indices)
+		goto out_err_free;
+
+	dsaddr->stripe_count = cnt;
+	dsaddr->ds_num = num;
+
+	memcpy(&dsaddr->deviceid.de_id, &pdev->dev_id, sizeof(pdev->dev_id));
+
+	/* Go back an read stripe indices */
+	p = indicesp;
+	indexp = &dsaddr->stripe_indices[0];
+	for (i = 0; i < dsaddr->stripe_count; i++) {
+		*indexp = be32_to_cpup(p++);
+		if (*indexp >= num)
+			goto out_err_free;
+		indexp++;
+	}
+	/* Skip already read multipath list count */
+	p++;
+
+	for (i = 0; i < dsaddr->ds_num; i++) {
+		int j;
+
+		dummy = be32_to_cpup(p++); /* multipath count */
+		if (dummy > 1) {
+			printk(KERN_WARNING
+			       "%s: Multipath count %d not supported, "
+			       "skipping all greater than 1\n", __func__,
+				dummy);
+		}
+		for (j = 0; j < dummy; j++) {
+			if (j == 0) {
+				dsaddr->ds_list[i] = decode_and_add_ds(&p, ino);
+				if (dsaddr->ds_list[i] == NULL)
+					goto out_err_free;
+			} else {
+				u32 len;
+				/* skip extra multipath */
+				len = be32_to_cpup(p++);
+				p += XDR_QUADLEN(len);
+				len = be32_to_cpup(p++);
+				p += XDR_QUADLEN(len);
+				continue;
+			}
+		}
+	}
+	nfs4_init_deviceid_node(&dsaddr->deviceid);
+
+	return dsaddr;
+
+out_err_free:
+	nfs4_fl_free_deviceid(dsaddr);
+out_err:
+	dprintk("%s ERROR: returning NULL\n", __func__);
+	return NULL;
+}
+
+/*
+ * Decode the opaque device specified in 'dev'
+ * and add it to the list of available devices.
+ * If the deviceid is already cached, nfs4_add_deviceid will return
+ * a pointer to the cached struct and throw away the new.
+ */
+static struct nfs4_file_layout_dsaddr*
+decode_and_add_device(struct inode *inode, struct pnfs_device *dev)
+{
+	struct nfs4_file_layout_dsaddr *dsaddr;
+	struct nfs4_deviceid *d;
+
+	dsaddr = decode_device(inode, dev);
+	if (!dsaddr) {
+		printk(KERN_WARNING "%s: Could not decode or add device\n",
+			__func__);
+		return NULL;
+	}
+
+	d = nfs4_add_deviceid(NFS_SERVER(inode)->nfs_client->cl_devid_cache,
+			      &dsaddr->deviceid);
+
+	return container_of(d, struct nfs4_file_layout_dsaddr, deviceid);
+}
+
+/*
+ * Retrieve the information for dev_id, add it to the list
+ * of available devices, and return it.
+ */
+struct nfs4_file_layout_dsaddr *
+get_device_info(struct inode *inode, struct pnfs_deviceid *dev_id)
+{
+	struct pnfs_device *pdev = NULL;
+	u32 max_resp_sz;
+	int max_pages;
+	struct page **pages = NULL;
+	struct nfs4_file_layout_dsaddr *dsaddr = NULL;
+	int rc, i;
+	struct nfs_server *server = NFS_SERVER(inode);
+
+	/*
+	 * Use the session max response size as the basis for setting
+	 * GETDEVICEINFO's maxcount
+	 */
+	max_resp_sz = server->nfs_client->cl_session->fc_attrs.max_resp_sz;
+	max_pages = max_resp_sz >> PAGE_SHIFT;
+	dprintk("%s inode %p max_resp_sz %u max_pages %d\n",
+		__func__, inode, max_resp_sz, max_pages);
+
+	pdev = kzalloc(sizeof(struct pnfs_device), GFP_KERNEL);
+	if (pdev == NULL)
+		return NULL;
+
+	pages = kzalloc(max_pages * sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		kfree(pdev);
+		return NULL;
+	}
+	for (i = 0; i < max_pages; i++) {
+		pages[i] = alloc_page(GFP_KERNEL);
+		if (!pages[i])
+			goto out_free;
+	}
+
+	/* set pdev->area */
+	pdev->area = vmap(pages, max_pages, VM_MAP, PAGE_KERNEL);
+	if (!pdev->area)
+		goto out_free;
+
+	memcpy(&pdev->dev_id, dev_id, sizeof(*dev_id));
+	pdev->layout_type = LAYOUT_NFSV4_1_FILES;
+	pdev->pages = pages;
+	pdev->pgbase = 0;
+	pdev->pglen = PAGE_SIZE * max_pages;
+	pdev->mincount = 0;
+	/* TODO: Update types when CB_NOTIFY_DEVICEID is available */
+	pdev->dev_notify_types = 0;
+
+	rc = nfs4_proc_getdeviceinfo(server, pdev);
+	dprintk("%s getdevice info returns %d\n", __func__, rc);
+	if (rc)
+		goto out_free;
+
+	/*
+	 * Found new device, need to decode it and then add it to the
+	 * list of known devices for this mountpoint.
+	 */
+	dsaddr = decode_and_add_device(inode, pdev);
+out_free:
+	if (pdev->area != NULL)
+		vunmap(pdev->area);
+	for (i = 0; i < max_pages; i++)
+		__free_page(pages[i]);
+	kfree(pages);
+	kfree(pdev);
+	dprintk("<-- %s dsaddr %p\n", __func__, dsaddr);
+	return dsaddr;
+}
+
+struct nfs4_file_layout_dsaddr *
+nfs4_fl_find_get_deviceid(struct nfs_client *clp, struct pnfs_deviceid *id)
+{
+	struct nfs4_deviceid *d;
+
+	d = nfs4_find_get_deviceid(clp->cl_devid_cache, id);
+	return (d == NULL) ? NULL :
+		container_of(d, struct nfs4_file_layout_dsaddr, deviceid);
+}
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-02 18:00 ` [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure Fred Isaman
@ 2010-09-10 19:23   ` Trond Myklebust
       [not found]     ` <1284146604.10062.68.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
                       ` (2 more replies)
  2010-09-10 23:58   ` Christoph Hellwig
  2010-09-13 15:07   ` Christoph Hellwig
  2 siblings, 3 replies; 55+ messages in thread
From: Trond Myklebust @ 2010-09-10 19:23 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
> From: The pNFS Team <linux-nfs@vger.kernel.org>
> 
> Allow a module implementing a layout type to register, and
> have its mount/umount routines called for filesystems that
> the server declares support it.
> 
> Signed-off-by: TBD - melding/reorganization of several patches
> ---
>  Documentation/filesystems/nfs/00-INDEX |    2 +
>  Documentation/filesystems/nfs/pnfs.txt |   48 +++++++++++++++++++
>  fs/nfs/Kconfig                         |    2 +-
>  fs/nfs/pnfs.c                          |   79 +++++++++++++++++++++++++++++++-
>  fs/nfs/pnfs.h                          |   14 ++++++
>  5 files changed, 142 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/filesystems/nfs/pnfs.txt
> 
> diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX
> index 2f68cd6..8d930b9 100644
> --- a/Documentation/filesystems/nfs/00-INDEX
> +++ b/Documentation/filesystems/nfs/00-INDEX
> @@ -12,5 +12,7 @@ nfs-rdma.txt
>  	- how to install and setup the Linux NFS/RDMA client and server software
>  nfsroot.txt
>  	- short guide on setting up a diskless box with NFS root filesystem.
> +pnfs.txt
> +	- short explanation of some of the internals of the pnfs code
>  rpc-cache.txt
>  	- introduction to the caching mechanisms in the sunrpc layer.
> diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt
> new file mode 100644
> index 0000000..bc0b9cf
> --- /dev/null
> +++ b/Documentation/filesystems/nfs/pnfs.txt
> @@ -0,0 +1,48 @@
> +Reference counting in pnfs:
> +==========================
> +
> +The are several inter-related caches.  We have layouts which can
> +reference multiple devices, each of which can reference multiple data servers.
> +Each data server can be referenced by multiple devices.  Each device
> +can be referenced by multiple layouts.  To keep all of this straight,
> +we need to reference count.
> +
> +
> +struct pnfs_layout_hdr
> +----------------------
> +The on-the-wire command LAYOUTGET corresponds to struct
> +pnfs_layout_segment, usually referred to by the variable name lseg.
> +Each nfs_inode may hold a pointer to a cache of of these layout
> +segments in nfsi->layout, of type struct pnfs_layout_hdr.
> +
> +We reference the header for the inode pointing to it, across each
> +outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
> +LAYOUTCOMMIT), and for each lseg held within.
> +
> +Each header is also (when non-empty) put on a list associated with
> +struct nfs_client (cl_layouts).  Being put on this list does not bump
> +the reference count, as the layout is kept around by the lseg that
> +keeps it in the list.
> +
> +deviceid_cache
> +--------------
> +lsegs reference device ids, which are resolved per nfs_client and
> +layout driver type.  The device ids are held in a RCU cache (struct
> +nfs4_deviceid_cache).  The cache itself is referenced across each
> +mount.  The entries (struct nfs4_deviceid) themselves are held across
> +the lifetime of each lseg referencing them.
> +
> +RCU is used because the deviceid is basically a write once, read many
> +data structure.  The hlist size of 32 buckets needs better
> +justification, but seems reasonable given that we can have multiple
> +deviceid's per filesystem, and multiple filesystems per nfs_client.
> +
> +The hash code is copied from the nfsd code base.  A discussion of
> +hashing and variations of this algorithm can be found at:
> +http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
> +
> +data server cache
> +-----------------
> +file driver devices refer to data servers, which are kept in a module
> +level cache.  Its reference is held over the lifetime of the deviceid
> +pointing to it.
> diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
> index 6c2aad4..5f1b936 100644
> --- a/fs/nfs/Kconfig
> +++ b/fs/nfs/Kconfig
> @@ -78,7 +78,7 @@ config NFS_V4_1
>  	depends on NFS_V4 && EXPERIMENTAL
>  	help
>  	  This option enables support for minor version 1 of the NFSv4 protocol
> -	  (draft-ietf-nfsv4-minorversion1) in the kernel's NFS client.
> +	  (RFC 5661) in the kernel's NFS client.
>  
>  	  If unsure, say N.
>  
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 2e5dba1..8d503fc 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -32,16 +32,48 @@
>  
>  #define NFSDBG_FACILITY		NFSDBG_PNFS
>  
> -/* STUB that returns the equivalent of "no module found" */
> +/* Locking:
> + *
> + * pnfs_spinlock:
> + *      protects pnfs_modules_tbl.
> + */
> +static DEFINE_SPINLOCK(pnfs_spinlock);
> +
> +/*
> + * pnfs_modules_tbl holds all pnfs modules
> + */
> +static LIST_HEAD(pnfs_modules_tbl);
> +
> +/* Return the registered pnfs layout driver module matching given id */
> +static struct pnfs_layoutdriver_type *
> +find_pnfs_driver_locked(u32 id) {
> +	struct  pnfs_layoutdriver_type *local;
> +
> +	dprintk("PNFS: %s: Searching for %u\n", __func__, id);
> +	list_for_each_entry(local, &pnfs_modules_tbl, pnfs_tblid)
> +		if (local->id == id)
> +			goto out;
> +	local = NULL;
> +out:
> +	return local;
> +}
> +
>  static struct pnfs_layoutdriver_type *
>  find_pnfs_driver(u32 id) {
> -	return NULL;
> +	struct  pnfs_layoutdriver_type *local;
> +
> +	spin_lock(&pnfs_spinlock);
> +	local = find_pnfs_driver_locked(id);

Don't you want some kind of reference count on this? I'd assume that you
probably need a module_get() with a corresponding module_put() when you
are done using the layoutdriver.

> +	spin_unlock(&pnfs_spinlock);
> +	return local;
>  }
>  
>  /* Unitialize a mountpoint in a layout driver */
>  void
>  unset_pnfs_layoutdriver(struct nfs_server *nfss)
>  {
> +	if (nfss->pnfs_curr_ld)
> +		nfss->pnfs_curr_ld->ld_io_ops->uninitialize_mountpoint(nfss->nfs_client);

That 'uninitialize_mountpoint' name doesn't make any sense. The
nfs_client parameter isn't associated to a particular mountpoint.

>  	nfss->pnfs_curr_ld = NULL;
>  }
>  
> @@ -68,6 +100,12 @@ set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
>  			goto out_no_driver;
>  		}
>  	}
> +	if (ld_type->ld_io_ops->initialize_mountpoint(server->nfs_client)) {

Ditto.

> +		printk(KERN_ERR
> +		       "%s: Error initializing mount point for layout driver %u.\n",
> +		       __func__, id);
> +		goto out_no_driver;
> +	}
>  	server->pnfs_curr_ld = ld_type;
>  	dprintk("%s: pNFS module for %u set\n", __func__, id);
>  	return;
> @@ -76,3 +114,40 @@ out_no_driver:
>  	dprintk("%s: Using NFSv4 I/O\n", __func__);
>  	server->pnfs_curr_ld = NULL;
>  }
> +
> +int
> +pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
> +{
> +	struct layoutdriver_io_operations *io_ops = ld_type->ld_io_ops;
> +	int status = -EINVAL;
> +
> +	if (!io_ops) {
> +		printk(KERN_ERR "%s Layout driver must provide io_ops\n",
> +			__func__);
> +		return status;
> +	}
> +
> +	spin_lock(&pnfs_spinlock);
> +	if (!find_pnfs_driver_locked(ld_type->id)) {
> +		list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
> +		status = 0;
> +		dprintk("%s Registering id:%u name:%s\n", __func__, ld_type->id,
> +			ld_type->name);
> +	} else
> +		printk(KERN_ERR "%s Module with id %d already loaded!\n",
> +			__func__, ld_type->id);
> +	spin_unlock(&pnfs_spinlock);
> +
> +	return status;
> +}
> +EXPORT_SYMBOL(pnfs_register_layoutdriver);
> +
> +void
> +pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
> +{
> +	dprintk("%s Deregistering id:%u\n", __func__, ld_type->id);
> +	spin_lock(&pnfs_spinlock);
> +	list_del(&ld_type->pnfs_tblid);
> +	spin_unlock(&pnfs_spinlock);
> +}
> +EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 3281fbf..9049b9a 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -16,8 +16,22 @@
>  
>  /* Per-layout driver specific registration structure */
>  struct pnfs_layoutdriver_type {
> +	struct list_head pnfs_tblid;
> +	const u32 id;
> +	const char *name;
> +	struct layoutdriver_io_operations *ld_io_ops;
>  };
>  
> +/* Layout driver I/O operations. */
> +struct layoutdriver_io_operations {
> +	/* Registration information for a new mounted file system */
> +	int (*initialize_mountpoint) (struct nfs_client *);
> +	int (*uninitialize_mountpoint) (struct nfs_client *);
> +};
> +
> +extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
> +extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
> +
>  void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
>  void unset_pnfs_layoutdriver(struct nfs_server *);
>  



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-02 18:00 ` [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver Fred Isaman
@ 2010-09-10 19:31   ` Trond Myklebust
  2010-09-10 21:11     ` Fred Isaman
  2010-09-10 23:56     ` Christoph Hellwig
  2010-09-13 15:08   ` Christoph Hellwig
  1 sibling, 2 replies; 55+ messages in thread
From: Trond Myklebust @ 2010-09-10 19:31 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
> From: The pNFS Team <linux-nfs@vger.kernel.org>
> 
> This driver just registers itself and supplies trivial mount/umount functions.
> 
> Signed-off-by: TBD - melding/reorganization of several patches
> ---
>  fs/nfs/Kconfig          |    5 +++
>  fs/nfs/Makefile         |    3 ++
>  fs/nfs/nfs4filelayout.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/nfs_fs.h  |    1 +
>  4 files changed, 98 insertions(+), 0 deletions(-)
>  create mode 100644 fs/nfs/nfs4filelayout.c
> 
> diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
> index 5f1b936..980f2dc 100644
> --- a/fs/nfs/Kconfig
> +++ b/fs/nfs/Kconfig
> @@ -82,6 +82,11 @@ config NFS_V4_1
>  
>  	  If unsure, say N.
>  
> +config PNFS_FILE_LAYOUT
> +	tristate
> +	depends on NFS_FS && NFS_V4_1
> +	default m

Should be 'default y', otherwise it has an implicit dependency on
CONFIG_MODULES.

> +
>  config ROOT_NFS
>  	bool "Root file system on NFS"
>  	depends on NFS_FS=y && IP_PNP
> diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
> index bb9e773..08a8889 100644
> --- a/fs/nfs/Makefile
> +++ b/fs/nfs/Makefile
> @@ -18,3 +18,6 @@ nfs-$(CONFIG_NFS_V4)	+= nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
>  nfs-$(CONFIG_NFS_V4_1)	+= pnfs.o
>  nfs-$(CONFIG_SYSCTL) += sysctl.o
>  nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
> +
> +obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
> +nfs_layout_nfsv41_files-y := nfs4filelayout.o
> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
> new file mode 100644
> index 0000000..c685196
> --- /dev/null
> +++ b/fs/nfs/nfs4filelayout.c
> @@ -0,0 +1,89 @@
> +/*
> + *  Module for the pnfs nfs4 file layout driver.
> + *  Defines all I/O and Policy interface operations, plus code
> + *  to register itself with the pNFS client.
> + *
> + *  Copyright (c) 2002
> + *  The Regents of the University of Michigan
> + *  All Rights Reserved
> + *
> + *  Dean Hildebrand <dhildebz@umich.edu>
> + *
> + *  Permission is granted to use, copy, create derivative works, and
> + *  redistribute this software and such derivative works for any purpose,
> + *  so long as the name of the University of Michigan is not used in
> + *  any advertising or publicity pertaining to the use or distribution
> + *  of this software without specific, written prior authorization. If
> + *  the above copyright notice or any other identification of the
> + *  University of Michigan is included in any copy of any portion of
> + *  this software, then the disclaimer below must also be included.
> + *
> + *  This software is provided as is, without representation or warranty
> + *  of any kind either express or implied, including without limitation
> + *  the implied warranties of merchantability, fitness for a particular
> + *  purpose, or noninfringement.  The Regents of the University of
> + *  Michigan shall not be liable for any damages, including special,
> + *  indirect, incidental, or consequential damages, with respect to any
> + *  claim arising out of or in connection with the use of the software,
> + *  even if it has been or is hereafter advised of the possibility of
> + *  such damages.
> + */
> +
> +#include <linux/nfs_fs.h>
> +#include "pnfs.h"
> +
> +#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Dean Hildebrand <dhildebz@umich.edu>");
> +MODULE_DESCRIPTION("The NFSv4 file layout driver");
> +
> +int
> +filelayout_initialize_mountpoint(struct nfs_client *clp)
> +{
> +	return 0;
> +}
> +
> +int
> +filelayout_uninitialize_mountpoint(struct nfs_client *clp)
> +{
> +	dprintk("--> %s\n", __func__);
> +
> +	return 0;
> +}
> +
> +struct layoutdriver_io_operations filelayout_io_operations = {

Should definitely be declared as 'const' (and possibly 'static').

> +	.initialize_mountpoint   = filelayout_initialize_mountpoint,
> +	.uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
> +};
> +
> +
> +struct pnfs_layoutdriver_type filelayout_type = {

Ditto.

> +	.id = LAYOUT_NFSV4_1_FILES,
> +	.name = "LAYOUT_NFSV4_1_FILES",
> +	.ld_io_ops = &filelayout_io_operations,

Why do we need a separate 'struct layoutdriver_io_operations'? Any
reason those can't just be embedded in struct pnfs_layoutdriver_type?

> +};
> +
> +static int __init nfs4filelayout_init(void)
> +{
> +	printk(KERN_INFO "%s: NFSv4 File Layout Driver Registering...\n",
> +	       __func__);
> +
> +	/*
> +	 * Need to register file_operations struct with global list to indicate
> +	 * that NFS4 file layout is a possible pNFS I/O module
> +	 */
> +	return pnfs_register_layoutdriver(&filelayout_type);
> +}
> +
> +static void __exit nfs4filelayout_exit(void)
> +{
> +	printk(KERN_INFO "%s: NFSv4 File Layout Driver Unregistering...\n",
> +	       __func__);
> +
> +	/* Unregister NFS4 file layout driver with pNFS client*/
> +	pnfs_unregister_layoutdriver(&filelayout_type);
> +}
> +
> +module_init(nfs4filelayout_init);
> +module_exit(nfs4filelayout_exit);
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index 042c2bd..a0f49a3 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -614,6 +614,7 @@ extern void * nfs_root_data(void);
>  #define NFSDBG_MOUNT		0x0400
>  #define NFSDBG_FSCACHE		0x0800
>  #define NFSDBG_PNFS		0x1000
> +#define NFSDBG_PNFS_LD		0x2000
>  #define NFSDBG_ALL		0xFFFF
>  
>  #ifdef __KERNEL__



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/13] RFC: nfs: create and destroy inode's layout cache
  2010-09-02 18:00 ` [PATCH 09/13] RFC: nfs: create and destroy inode's layout cache Fred Isaman
@ 2010-09-10 19:43   ` Trond Myklebust
       [not found]     ` <1284147785.10062.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2010-09-13 11:32     ` Benny Halevy
  0 siblings, 2 replies; 55+ messages in thread
From: Trond Myklebust @ 2010-09-10 19:43 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
> From: The pNFS Team <linux-nfs@vger.kernel.org>
> 
> At the start of the io paths, try to grab the relevant layout
> information.  This will initiate the inode's layout cache, but
> stubs ensure the cache stays empty.
> 
> Signed-off-by: TBD - melding/reorganization of several patches
> ---
>  fs/nfs/file.c          |    5 ++
>  fs/nfs/inode.c         |    3 +
>  fs/nfs/pnfs.c          |  140 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nfs/pnfs.h          |   39 +++++++++++++
>  fs/nfs/read.c          |    3 +
>  include/linux/nfs_fs.h |    3 +
>  6 files changed, 193 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index eb51bd6..10ebdfb 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -36,6 +36,7 @@
>  #include "internal.h"
>  #include "iostat.h"
>  #include "fscache.h"
> +#include "pnfs.h"
>  
>  #define NFSDBG_FACILITY		NFSDBG_FILE
>  
> @@ -386,6 +387,10 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
>  		file->f_path.dentry->d_name.name,
>  		mapping->host->i_ino, len, (long long) pos);
>  
> +	pnfs_update_layout(mapping->host,
> +			   nfs_file_open_context(file),
> +			   IOMODE_RW);
> +
>  start:
>  	/*
>  	 * Prevent starvation issues if someone is doing a consistency
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 7d2d6c7..0dc6dad 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -48,6 +48,7 @@
>  #include "internal.h"
>  #include "fscache.h"
>  #include "dns_resolve.h"
> +#include "pnfs.h"
>  
>  #define NFSDBG_FACILITY		NFSDBG_VFS
>  
> @@ -1409,6 +1410,7 @@ void nfs4_evict_inode(struct inode *inode)
>  {
>  	truncate_inode_pages(&inode->i_data, 0);
>  	end_writeback(inode);
> +	pnfs_destroy_layout(NFS_I(inode));
>  	/* If we are holding a delegation, return it! */
>  	nfs_inode_return_delegation_noreclaim(inode);
>  	/* First call standard NFS clear_inode() code */
> @@ -1446,6 +1448,7 @@ static inline void nfs4_init_once(struct nfs_inode *nfsi)
>  	nfsi->delegation = NULL;
>  	nfsi->delegation_state = 0;
>  	init_rwsem(&nfsi->rwsem);
> +	nfsi->layout = NULL;
>  #endif
>  }
>  
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 8d503fc..65f923b 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -151,3 +151,143 @@ pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>  	spin_unlock(&pnfs_spinlock);
>  }
>  EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
> +
> +static void
> +get_layout_hdr_locked(struct pnfs_layout_hdr *lo)
> +{
> +	assert_spin_locked(&lo->inode->i_lock);
> +	lo->refcount++;
> +}
> +
> +static void
> +put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
> +{
> +	assert_spin_locked(&lo->inode->i_lock);
> +	BUG_ON(lo->refcount <= 0);
> +
> +	lo->refcount--;
> +	if (!lo->refcount) {
> +		dprintk("%s: freeing layout cache %p\n", __func__, lo);
> +		NFS_I(lo->inode)->layout = NULL;
> +		kfree(lo);
> +	}
> +}
> +
> +void
> +pnfs_destroy_layout(struct nfs_inode *nfsi)
> +{
> +	struct pnfs_layout_hdr *lo;
> +
> +	spin_lock(&nfsi->vfs_inode.i_lock);
> +	lo = nfsi->layout;
> +	if (lo) {
> +		/* Matched by refcount set to 1 in alloc_init_layout_hdr */
> +		put_layout_hdr_locked(lo);
> +	}
> +	spin_unlock(&nfsi->vfs_inode.i_lock);
> +}
> +
> +/* STUB - pretend LAYOUTGET to server failed */
> +static struct pnfs_layout_segment *
> +send_layoutget(struct pnfs_layout_hdr *lo,
> +	   struct nfs_open_context *ctx,
> +	   u32 iomode)
> +{
> +	struct inode *ino = lo->inode;
> +
> +	set_bit(lo_fail_bit(iomode), &lo->state);
> +	spin_lock(&ino->i_lock);
> +	put_layout_hdr_locked(lo);
> +	spin_unlock(&ino->i_lock);
> +	return NULL;
> +}
> +
> +static struct pnfs_layout_hdr *
> +alloc_init_layout_hdr(struct inode *ino)
> +{
> +	struct pnfs_layout_hdr *lo;
> +
> +	lo = kzalloc(sizeof(struct pnfs_layout_hdr), GFP_KERNEL);
> +	if (!lo)
> +		return NULL;
> +	lo->refcount = 1;
> +	lo->inode = ino;
> +	return lo;
> +}
> +
> +static struct pnfs_layout_hdr *
> +pnfs_find_alloc_layout(struct inode *ino)
> +{
> +	struct nfs_inode *nfsi = NFS_I(ino);
> +	struct pnfs_layout_hdr *new = NULL;
> +
> +	dprintk("%s Begin ino=%p layout=%p\n", __func__, ino, nfsi->layout);
> +
> +	assert_spin_locked(&ino->i_lock);
> +	if (nfsi->layout)
> +		return nfsi->layout;
> +
> +	spin_unlock(&ino->i_lock);
> +	new = alloc_init_layout_hdr(ino);
> +	spin_lock(&ino->i_lock);
> +
> +	if (likely(nfsi->layout == NULL))	/* Won the race? */
> +		nfsi->layout = new;
> +	else
> +		kfree(new);
> +	return nfsi->layout;
> +}
> +
> +/* STUB - LAYOUTGET never succeeds, so cache is empty */
> +static struct pnfs_layout_segment *
> +pnfs_has_layout(struct pnfs_layout_hdr *lo, u32 iomode)
> +{
> +	return NULL;
> +}
> +
> +/*
> + * Layout segment is retreived from the server if not cached.
> + * The appropriate layout segment is referenced and returned to the caller.
> + */
> +struct pnfs_layout_segment *
> +pnfs_update_layout(struct inode *ino,
> +		   struct nfs_open_context *ctx,
> +		   enum pnfs_iomode iomode)
> +{
> +	struct nfs_inode *nfsi = NFS_I(ino);
> +	struct pnfs_layout_hdr *lo;
> +	struct pnfs_layout_segment *lseg = NULL;
> +
> +	if (!pnfs_enabled_sb(NFS_SERVER(ino)))
> +		return NULL;
> +	spin_lock(&ino->i_lock);
> +	lo = pnfs_find_alloc_layout(ino);
> +	if (lo == NULL) {
> +		dprintk("%s ERROR: can't get pnfs_layout_hdr\n", __func__);
> +		goto out_unlock;
> +	}
> +
> +	/* Check to see if the layout for the given range already exists */
> +	lseg = pnfs_has_layout(lo, iomode);
> +	if (lseg) {
> +		dprintk("%s: Using cached lseg %p for iomode %d)\n",
> +			__func__, lseg, iomode);
> +		goto out_unlock;
> +	}
> +
> +	/* if LAYOUTGET already failed once we don't try again */
> +	if (test_bit(lo_fail_bit(iomode), &nfsi->layout->state))
> +		goto out_unlock;
> +
> +	get_layout_hdr_locked(lo);
> +	spin_unlock(&ino->i_lock);
> +
> +	lseg = send_layoutget(lo, ctx, iomode);
> +out:
> +	dprintk("%s end, state 0x%lx lseg %p\n", __func__,
> +		nfsi->layout->state, lseg);
> +	return lseg;
> +out_unlock:
> +	spin_unlock(&ino->i_lock);
> +	goto out;
> +}
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 9049b9a..b63b445 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -14,6 +14,11 @@
>  
>  #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
>  
> +enum {
> +	NFS_LAYOUT_RO_FAILED = 0,	/* get ro layout failed stop trying */
> +	NFS_LAYOUT_RW_FAILED,		/* get rw layout failed stop trying */
> +};
> +
>  /* Per-layout driver specific registration structure */
>  struct pnfs_layoutdriver_type {
>  	struct list_head pnfs_tblid;
> @@ -22,6 +27,12 @@ struct pnfs_layoutdriver_type {
>  	struct layoutdriver_io_operations *ld_io_ops;
>  };
>  
> +struct pnfs_layout_hdr {
> +	int			refcount;
        ^^^^^ Why not make this 'unsigned int', and/or 'unsigned long'?
> +	unsigned long		state;
> +	struct inode		*inode;
> +};
> +
>  /* Layout driver I/O operations. */
>  struct layoutdriver_io_operations {
>  	/* Registration information for a new mounted file system */
> @@ -32,11 +43,39 @@ struct layoutdriver_io_operations {
>  extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
>  extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
>  
> +struct pnfs_layout_segment *
> +pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
> +		   enum pnfs_iomode access_type);
>  void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
>  void unset_pnfs_layoutdriver(struct nfs_server *);
> +void pnfs_destroy_layout(struct nfs_inode *);
> +
> +
> +static inline int lo_fail_bit(u32 iomode)
> +{
> +	return iomode == IOMODE_RW ?
> +			 NFS_LAYOUT_RW_FAILED : NFS_LAYOUT_RO_FAILED;
> +}
> +
> +/* Return true if a layout driver is being used for this mountpoint */
> +static inline int pnfs_enabled_sb(struct nfs_server *nfss)
> +{
> +	return nfss->pnfs_curr_ld != NULL;
> +}
>  
>  #else  /* CONFIG_NFS_V4_1 */
>  
> +static inline void pnfs_destroy_layout(struct nfs_inode *nfsi)
> +{
> +}
> +
> +static inline struct pnfs_layout_segment *
> +pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
> +		   enum pnfs_iomode access_type)
> +{
> +	return NULL;
> +}
> +
>  static inline void set_pnfs_layoutdriver(struct nfs_server *s, u32 id)
>  {
>  }
> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
> index 87adc27..f7eb66f 100644
> --- a/fs/nfs/read.c
> +++ b/fs/nfs/read.c
> @@ -25,6 +25,7 @@
>  #include "internal.h"
>  #include "iostat.h"
>  #include "fscache.h"
> +#include "pnfs.h"
>  
>  #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
>  
> @@ -121,6 +122,7 @@ int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
>  	len = nfs_page_length(page);
>  	if (len == 0)
>  		return nfs_return_empty_page(page);
> +	pnfs_update_layout(inode, ctx, IOMODE_READ);
>  	new = nfs_create_request(ctx, inode, page, 0, len);
>  	if (IS_ERR(new)) {
>  		unlock_page(page);
> @@ -625,6 +627,7 @@ int nfs_readpages(struct file *filp, struct address_space *mapping,
>  	if (ret == 0)
>  		goto read_complete; /* all pages were read */
>  
> +	pnfs_update_layout(inode, desc.ctx, IOMODE_READ);
>  	if (rsize < PAGE_CACHE_SIZE)
>  		nfs_pageio_init(&pgio, inode, nfs_pagein_multi, rsize, 0);
>  	else
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index a0f49a3..ebd87a9 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -188,6 +188,9 @@ struct nfs_inode {
>  	struct nfs_delegation	*delegation;
>  	fmode_t			 delegation_state;
>  	struct rw_semaphore	rwsem;
> +
> +	/* pNFS layout information */
> +	struct pnfs_layout_hdr *layout;
>  #endif /* CONFIG_NFS_V4*/
>  #ifdef CONFIG_NFS_FSCACHE
>  	struct fscache_cookie	*fscache;



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/13] RFC: nfs: client needs to maintain list of inodes with active layouts
  2010-09-02 18:00 ` [PATCH 10/13] RFC: nfs: client needs to maintain list of inodes with active layouts Fred Isaman
@ 2010-09-10 19:59   ` Trond Myklebust
       [not found]     ` <1284148768.10062.94.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 55+ messages in thread
From: Trond Myklebust @ 2010-09-10 19:59 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
> From: The pNFS Team <linux-nfs@vger.kernel.org>

> +static inline struct nfs_server *
> +PNFS_NFS_SERVER(struct pnfs_layout_hdr *lo)
> +{
> +	return NFS_SERVER(lo->inode);
> +}
> +

Why do we need this?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
  2010-09-02 18:00 ` [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure Fred Isaman
@ 2010-09-10 20:11   ` Trond Myklebust
  2010-09-10 21:47     ` Fred Isaman
  0 siblings, 1 reply; 55+ messages in thread
From: Trond Myklebust @ 2010-09-10 20:11 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
> From: The pNFS Team <linux-nfs@vger.kernel.org>
> 
> Add the ability to actually send LAYOUTGET and GETDEVICEINFO.  This also adds
> in the machinery to handle layout state and the deviceid cache.  Note that
> GETDEVICEINFO is not called directly by the generic layer.  Instead it
> is called by the drivers while parsing the LAYOUTGET opaque data in response
> to an unknown device id embedded therein.  Annoyingly, RFC 5661 only encodes
> device ids within the driver-specific opaque data.
> 
> Signed-off-by: TBD - melding/reorganization of several patches
> ---
>  fs/nfs/nfs4proc.c         |  134 ++++++++++++++++
>  fs/nfs/nfs4xdr.c          |  302 +++++++++++++++++++++++++++++++++++
>  fs/nfs/pnfs.c             |  382 ++++++++++++++++++++++++++++++++++++++++++---
>  fs/nfs/pnfs.h             |   91 +++++++++++-
>  include/linux/nfs4.h      |    2 +
>  include/linux/nfs_fs_sb.h |    1 +
>  include/linux/nfs_xdr.h   |   49 ++++++
>  7 files changed, 935 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index c7c7277..7eeea0e 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -55,6 +55,7 @@
>  #include "internal.h"
>  #include "iostat.h"
>  #include "callback.h"
> +#include "pnfs.h"
>  
>  #define NFSDBG_FACILITY		NFSDBG_PROC
>  
> @@ -5335,6 +5336,139 @@ out:
>  	dprintk("<-- %s status=%d\n", __func__, status);
>  	return status;
>  }
> +
> +static void
> +nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
> +{
> +	struct nfs4_layoutget *lgp = calldata;
> +	struct inode *ino = lgp->args.inode;
> +	struct nfs_server *server = NFS_SERVER(ino);
> +
> +	dprintk("--> %s\n", __func__);
> +	if (nfs4_setup_sequence(server, &lgp->args.seq_args,
> +				&lgp->res.seq_res, 0, task))
> +		return;
> +	rpc_call_start(task);
> +}
> +
> +static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
> +{
> +	struct nfs4_layoutget *lgp = calldata;
> +	struct inode *ino = lgp->args.inode;
> +	struct nfs_server *server = NFS_SERVER(ino);
> +
> +	dprintk("--> %s\n", __func__);
> +
> +	if (!nfs4_sequence_done(task, &lgp->res.seq_res))
> +		return;
> +
> +	if (RPC_ASSASSINATED(task))
> +		return;
> +
> +	if (nfs4_async_handle_error(task, server, NULL) == -EAGAIN)
> +		nfs_restart_rpc(task, server->nfs_client);
> +
> +	lgp->status = task->tk_status;
> +	dprintk("<-- %s\n", __func__);
> +}
> +
> +static void nfs4_layoutget_release(void *calldata)
> +{
> +	struct nfs4_layoutget *lgp = calldata;
> +
> +	dprintk("--> %s\n", __func__);
> +	put_layout_hdr(lgp->args.inode);
> +	if (lgp->res.layout.buf != NULL)
> +		free_page((unsigned long) lgp->res.layout.buf);
> +	put_nfs_open_context(lgp->args.ctx);
> +	kfree(calldata);
> +	dprintk("<-- %s\n", __func__);
> +}
> +
> +static const struct rpc_call_ops nfs4_layoutget_call_ops = {
> +	.rpc_call_prepare = nfs4_layoutget_prepare,
> +	.rpc_call_done = nfs4_layoutget_done,
> +	.rpc_release = nfs4_layoutget_release,
> +};
> +
> +static int _nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
> +{
> +	struct nfs_server *server = NFS_SERVER(lgp->args.inode);
> +	struct rpc_task *task;
> +	struct rpc_message msg = {
> +		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_LAYOUTGET],
> +		.rpc_argp = &lgp->args,
> +		.rpc_resp = &lgp->res,
> +	};
> +	struct rpc_task_setup task_setup_data = {
> +		.rpc_client = server->client,
> +		.rpc_message = &msg,
> +		.callback_ops = &nfs4_layoutget_call_ops,
> +		.callback_data = lgp,
> +		.flags = RPC_TASK_ASYNC,
> +	};
> +	int status = 0;
> +
> +	dprintk("--> %s\n", __func__);
> +
> +	lgp->res.layout.buf = (void *)__get_free_page(GFP_NOFS);
> +	if (lgp->res.layout.buf == NULL) {
> +		nfs4_layoutget_release(lgp);
> +		return -ENOMEM;
> +	}
> +
> +	lgp->res.seq_res.sr_slotid = NFS4_MAX_SLOT_TABLE;
> +	task = rpc_run_task(&task_setup_data);
> +	if (IS_ERR(task))
> +		return PTR_ERR(task);
> +	status = nfs4_wait_for_completion_rpc_task(task);
> +	if (status != 0)
> +		goto out;
> +	status = lgp->status;
> +	if (status != 0)
> +		goto out;
> +	status = pnfs_layout_process(lgp);
> +out:
> +	rpc_put_task(task);
> +	dprintk("<-- %s status=%d\n", __func__, status);
> +	return status;
> +}
> +
> +int nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
> +{
> +	struct nfs_server *server = NFS_SERVER(lgp->args.inode);
> +	struct nfs4_exception exception = { };
> +	int err;
> +	do {
> +		err = nfs4_handle_exception(server, _nfs4_proc_layoutget(lgp),
> +					    &exception);
> +	} while (exception.retry);
> +	return err;
> +}

Since nfs4_layoutget_done() already calls nfs4_async_handle_error(), do
you really need to call nfs4_handle_exception()?

> +
> +int nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
> +{
> +	struct nfs4_getdeviceinfo_args args = {
> +		.pdev = pdev,
> +	};
> +	struct nfs4_getdeviceinfo_res res = {
> +		.pdev = pdev,
> +	};
> +	struct rpc_message msg = {
> +		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_GETDEVICEINFO],
> +		.rpc_argp = &args,
> +		.rpc_resp = &res,
> +	};
> +	int status;
> +
> +	dprintk("--> %s\n", __func__);
> +	status = nfs4_call_sync(server, &msg, &args, &res, 0);
> +	dprintk("<-- %s status=%d\n", __func__, status);
> +
> +	return status;
> +}
> +EXPORT_SYMBOL_GPL(nfs4_proc_getdeviceinfo);
> +

This, on the other hand, might need a 'handle exception' wrapper.

>  #endif /* CONFIG_NFS_V4_1 */
>  
>  struct nfs4_state_recovery_ops nfs40_reboot_recovery_ops = {
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index 60233ae..aaf6fe5 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -52,6 +52,7 @@
>  #include <linux/nfs_idmap.h>
>  #include "nfs4_fs.h"
>  #include "internal.h"
> +#include "pnfs.h"
>  
>  #define NFSDBG_FACILITY		NFSDBG_XDR
>  
> @@ -310,6 +311,19 @@ static int nfs4_stat_to_errno(int);
>  				XDR_QUADLEN(NFS4_MAX_SESSIONID_LEN) + 5)
>  #define encode_reclaim_complete_maxsz	(op_encode_hdr_maxsz + 4)
>  #define decode_reclaim_complete_maxsz	(op_decode_hdr_maxsz + 4)
> +#define encode_getdeviceinfo_maxsz (op_encode_hdr_maxsz + 4 + \
> +				XDR_QUADLEN(NFS4_PNFS_DEVICEID4_SIZE))
> +#define decode_getdeviceinfo_maxsz (op_decode_hdr_maxsz + \
> +				1 /* layout type */ + \
> +				1 /* opaque devaddr4 length */ + \
> +				  /* devaddr4 payload is read into page */ \
> +				1 /* notification bitmap length */ + \
> +				1 /* notification bitmap */)
> +#define encode_layoutget_maxsz	(op_encode_hdr_maxsz + 10 + \
> +				encode_stateid_maxsz)
> +#define decode_layoutget_maxsz	(op_decode_hdr_maxsz + 8 + \
> +				decode_stateid_maxsz + \
> +				XDR_QUADLEN(PNFS_LAYOUT_MAXSIZE))
>  #else /* CONFIG_NFS_V4_1 */
>  #define encode_sequence_maxsz	0
>  #define decode_sequence_maxsz	0
> @@ -699,6 +713,20 @@ static int nfs4_stat_to_errno(int);
>  #define NFS4_dec_reclaim_complete_sz	(compound_decode_hdr_maxsz + \
>  					 decode_sequence_maxsz + \
>  					 decode_reclaim_complete_maxsz)
> +#define NFS4_enc_getdeviceinfo_sz (compound_encode_hdr_maxsz +    \
> +				encode_sequence_maxsz +\
> +				encode_getdeviceinfo_maxsz)
> +#define NFS4_dec_getdeviceinfo_sz (compound_decode_hdr_maxsz +    \
> +				decode_sequence_maxsz + \
> +				decode_getdeviceinfo_maxsz)
> +#define NFS4_enc_layoutget_sz	(compound_encode_hdr_maxsz + \
> +				encode_sequence_maxsz + \
> +				encode_putfh_maxsz +        \
> +				encode_layoutget_maxsz)
> +#define NFS4_dec_layoutget_sz	(compound_decode_hdr_maxsz + \
> +				decode_sequence_maxsz + \
> +				decode_putfh_maxsz +        \
> +				decode_layoutget_maxsz)
>  
>  const u32 nfs41_maxwrite_overhead = ((RPC_MAX_HEADER_WITH_AUTH +
>  				      compound_encode_hdr_maxsz +
> @@ -1726,6 +1754,61 @@ static void encode_sequence(struct xdr_stream *xdr,
>  #endif /* CONFIG_NFS_V4_1 */
>  }
>  
> +#ifdef CONFIG_NFS_V4_1
> +static void
> +encode_getdeviceinfo(struct xdr_stream *xdr,
> +		     const struct nfs4_getdeviceinfo_args *args,
> +		     struct compound_hdr *hdr)
> +{
> +	int has_bitmap = (args->pdev->dev_notify_types != 0);
> +	int len = 16 + NFS4_PNFS_DEVICEID4_SIZE + (has_bitmap * 4);
> +	__be32 *p;
> +
> +	p = reserve_space(xdr, len);
> +	*p++ = cpu_to_be32(OP_GETDEVICEINFO);
> +	p = xdr_encode_opaque_fixed(p, args->pdev->dev_id.data,
> +				    NFS4_PNFS_DEVICEID4_SIZE);
> +	*p++ = cpu_to_be32(args->pdev->layout_type);
> +	*p++ = cpu_to_be32(args->pdev->pglen);		/* gdia_maxcount */
> +	*p++ = cpu_to_be32(has_bitmap);			/* bitmap length [01] */
> +	if (has_bitmap)
> +		*p = cpu_to_be32(args->pdev->dev_notify_types);

We don't support notification callbacks yet.

> +	hdr->nops++;
> +	hdr->replen += decode_getdeviceinfo_maxsz;
> +}
> +
> +static void
> +encode_layoutget(struct xdr_stream *xdr,
> +		      const struct nfs4_layoutget_args *args,
> +		      struct compound_hdr *hdr)
> +{
> +	nfs4_stateid stateid;
> +	__be32 *p;
> +
> +	p = reserve_space(xdr, 44 + NFS4_STATEID_SIZE);
> +	*p++ = cpu_to_be32(OP_LAYOUTGET);
> +	*p++ = cpu_to_be32(0);     /* Signal layout available */
> +	*p++ = cpu_to_be32(args->type);
> +	*p++ = cpu_to_be32(args->range.iomode);
> +	p = xdr_encode_hyper(p, args->range.offset);
> +	p = xdr_encode_hyper(p, args->range.length);
> +	p = xdr_encode_hyper(p, args->minlength);
> +	pnfs_get_layout_stateid(&stateid, NFS_I(args->inode)->layout);
> +	p = xdr_encode_opaque_fixed(p, &stateid.data, NFS4_STATEID_SIZE);
> +	*p = cpu_to_be32(args->maxcount);
> +
> +	dprintk("%s: 1st type:0x%x iomode:%d off:%lu len:%lu mc:%d\n",
> +		__func__,
> +		args->type,
> +		args->range.iomode,
> +		(unsigned long)args->range.offset,
> +		(unsigned long)args->range.length,
> +		args->maxcount);
> +	hdr->nops++;
> +	hdr->replen += decode_layoutget_maxsz;
> +}
> +#endif /* CONFIG_NFS_V4_1 */
> +
>  /*
>   * END OF "GENERIC" ENCODE ROUTINES.
>   */
> @@ -2543,6 +2626,51 @@ static int nfs4_xdr_enc_reclaim_complete(struct rpc_rqst *req, uint32_t *p,
>  	return 0;
>  }
>  
> +/*
> + * Encode GETDEVICEINFO request
> + */
> +static int nfs4_xdr_enc_getdeviceinfo(struct rpc_rqst *req, uint32_t *p,
> +				      struct nfs4_getdeviceinfo_args *args)
> +{
> +	struct xdr_stream xdr;
> +	struct compound_hdr hdr = {
> +		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
> +	};
> +
> +	xdr_init_encode(&xdr, &req->rq_snd_buf, p);
> +	encode_compound_hdr(&xdr, req, &hdr);
> +	encode_sequence(&xdr, &args->seq_args, &hdr);
> +	encode_getdeviceinfo(&xdr, args, &hdr);
> +
> +	/* set up reply kvec. Subtract notification bitmap max size (2)
> +	 * so that notification bitmap is put in xdr_buf tail */
> +	xdr_inline_pages(&req->rq_rcv_buf, (hdr.replen - 2) << 2,
> +			 args->pdev->pages, args->pdev->pgbase,
> +			 args->pdev->pglen);
> +
> +	encode_nops(&hdr);
> +	return 0;
> +}
> +
> +/*
> + *  Encode LAYOUTGET request
> + */
> +static int nfs4_xdr_enc_layoutget(struct rpc_rqst *req, uint32_t *p,
> +				  struct nfs4_layoutget_args *args)
> +{
> +	struct xdr_stream xdr;
> +	struct compound_hdr hdr = {
> +		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
> +	};
> +
> +	xdr_init_encode(&xdr, &req->rq_snd_buf, p);
> +	encode_compound_hdr(&xdr, req, &hdr);
> +	encode_sequence(&xdr, &args->seq_args, &hdr);
> +	encode_putfh(&xdr, NFS_FH(args->inode), &hdr);
> +	encode_layoutget(&xdr, args, &hdr);
> +	encode_nops(&hdr);
> +	return 0;
> +}
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  static void print_overflow_msg(const char *func, const struct xdr_stream *xdr)
> @@ -4788,6 +4916,131 @@ out_overflow:
>  #endif /* CONFIG_NFS_V4_1 */
>  }
>  
> +#if defined(CONFIG_NFS_V4_1)
> +
> +static int decode_getdeviceinfo(struct xdr_stream *xdr,
> +				struct pnfs_device *pdev)
> +{
> +	__be32 *p;
> +	uint32_t len, type;
> +	int status;
> +
> +	status = decode_op_hdr(xdr, OP_GETDEVICEINFO);
> +	if (status) {
> +		if (status == -ETOOSMALL) {
> +			p = xdr_inline_decode(xdr, 4);
> +			if (unlikely(!p))
> +				goto out_overflow;
> +			pdev->mincount = be32_to_cpup(p);
> +			dprintk("%s: Min count too small. mincnt = %u\n",
> +				__func__, pdev->mincount);
> +		}
> +		return status;
> +	}
> +
> +	p = xdr_inline_decode(xdr, 8);
> +	if (unlikely(!p))
> +		goto out_overflow;
> +	type = be32_to_cpup(p++);
> +	if (type != pdev->layout_type) {
> +		dprintk("%s: layout mismatch req: %u pdev: %u\n",
> +			__func__, pdev->layout_type, type);
> +		return -EINVAL;
> +	}
> +	/*
> +	 * Get the length of the opaque device_addr4. xdr_read_pages places
> +	 * the opaque device_addr4 in the xdr_buf->pages (pnfs_device->pages)
> +	 * and places the remaining xdr data in xdr_buf->tail
> +	 */
> +	pdev->mincount = be32_to_cpup(p);
> +	xdr_read_pages(xdr, pdev->mincount); /* include space for the length */
> +
> +	/*
> +	 * At most one bitmap word. If the server returns a bitmap of more
> +	 * than one word we ignore the extra invalid words given that
> +	 * getdeviceinfo is the final operation in the compound.
> +	 */
> +	p = xdr_inline_decode(xdr, 4);
> +	if (unlikely(!p))
> +		goto out_overflow;
> +	len = be32_to_cpup(p);
> +	if (len) {
> +		p = xdr_inline_decode(xdr, 4);
> +		if (unlikely(!p))
> +			goto out_overflow;
> +		pdev->dev_notify_types = be32_to_cpup(p);
> +	} else
> +		pdev->dev_notify_types = 0;

Again, we don't support notifications.

> +	return 0;
> +out_overflow:
> +	print_overflow_msg(__func__, xdr);
> +	return -EIO;
> +}
> +
> +static int decode_layoutget(struct xdr_stream *xdr, struct rpc_rqst *req,
> +			    struct nfs4_layoutget_res *res)
> +{
> +	__be32 *p;
> +	int status;
> +	u32 layout_count;
> +
> +	status = decode_op_hdr(xdr, OP_LAYOUTGET);
> +	if (status)
> +		return status;
> +	p = xdr_inline_decode(xdr, 8 + NFS4_STATEID_SIZE);
> +	if (unlikely(!p))
> +		goto out_overflow;
> +	res->return_on_close = be32_to_cpup(p++);
> +	p = xdr_decode_opaque_fixed(p, res->stateid.data, NFS4_STATEID_SIZE);
> +	layout_count = be32_to_cpup(p);
> +	if (!layout_count) {
> +		dprintk("%s: server responded with empty layout array\n",
> +			__func__);
> +		return -EINVAL;
> +	}
> +
> +	p = xdr_inline_decode(xdr, 24);
> +	if (unlikely(!p))
> +		goto out_overflow;
> +	p = xdr_decode_hyper(p, &res->range.offset);
> +	p = xdr_decode_hyper(p, &res->range.length);
> +	res->range.iomode = be32_to_cpup(p++);
> +	res->type = be32_to_cpup(p++);
> +
> +	status = decode_opaque_inline(xdr, &res->layout.len, (char **)&p);
> +	if (unlikely(status))
> +		return status;
> +
> +	dprintk("%s roff:%lu rlen:%lu riomode:%d, lo_type:0x%x, lo.len:%d\n",
> +		__func__,
> +		(unsigned long)res->range.offset,
> +		(unsigned long)res->range.length,
> +		res->range.iomode,
> +		res->type,
> +		res->layout.len);
> +
> +	/* nfs4_proc_layoutget allocated a single page */
> +	if (res->layout.len > PAGE_SIZE)
> +		return -ENOMEM;
> +	memcpy(res->layout.buf, p, res->layout.len);
> +
> +	if (layout_count > 1) {
> +		/* We only handle a length one array at the moment.  Any
> +		 * further entries are just ignored.  Note that this means
> +		 * the client may see a response that is less than the
> +		 * minimum it requested.
> +		 */
> +		dprintk("%s: server responded with %d layouts, dropping tail\n",
> +			__func__, layout_count);
> +	}
> +
> +	return 0;
> +out_overflow:
> +	print_overflow_msg(__func__, xdr);
> +	return -EIO;
> +}
> +#endif /* CONFIG_NFS_V4_1 */
> +
>  /*
>   * END OF "GENERIC" DECODE ROUTINES.
>   */
> @@ -5815,6 +6068,53 @@ static int nfs4_xdr_dec_reclaim_complete(struct rpc_rqst *rqstp, uint32_t *p,
>  		status = decode_reclaim_complete(&xdr, (void *)NULL);
>  	return status;
>  }
> +
> +/*
> + * Decode GETDEVINFO response
> + */
> +static int nfs4_xdr_dec_getdeviceinfo(struct rpc_rqst *rqstp, uint32_t *p,
> +				      struct nfs4_getdeviceinfo_res *res)
> +{
> +	struct xdr_stream xdr;
> +	struct compound_hdr hdr;
> +	int status;
> +
> +	xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
> +	status = decode_compound_hdr(&xdr, &hdr);
> +	if (status != 0)
> +		goto out;
> +	status = decode_sequence(&xdr, &res->seq_res, rqstp);
> +	if (status != 0)
> +		goto out;
> +	status = decode_getdeviceinfo(&xdr, res->pdev);
> +out:
> +	return status;
> +}
> +
> +/*
> + * Decode LAYOUTGET response
> + */
> +static int nfs4_xdr_dec_layoutget(struct rpc_rqst *rqstp, uint32_t *p,
> +				  struct nfs4_layoutget_res *res)
> +{
> +	struct xdr_stream xdr;
> +	struct compound_hdr hdr;
> +	int status;
> +
> +	xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
> +	status = decode_compound_hdr(&xdr, &hdr);
> +	if (status)
> +		goto out;
> +	status = decode_sequence(&xdr, &res->seq_res, rqstp);
> +	if (status)
> +		goto out;
> +	status = decode_putfh(&xdr);
> +	if (status)
> +		goto out;
> +	status = decode_layoutget(&xdr, rqstp, res);
> +out:
> +	return status;
> +}
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  __be32 *nfs4_decode_dirent(__be32 *p, struct nfs_entry *entry, int plus)
> @@ -5993,6 +6293,8 @@ struct rpc_procinfo	nfs4_procedures[] = {
>    PROC(SEQUENCE,	enc_sequence,	dec_sequence),
>    PROC(GET_LEASE_TIME,	enc_get_lease_time,	dec_get_lease_time),
>    PROC(RECLAIM_COMPLETE, enc_reclaim_complete,  dec_reclaim_complete),
> +  PROC(GETDEVICEINFO, enc_getdeviceinfo, dec_getdeviceinfo),
> +  PROC(LAYOUTGET,  enc_layoutget,     dec_layoutget),
>  #endif /* CONFIG_NFS_V4_1 */
>  };
>  
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index cbce942..faf6c4c 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -128,6 +128,12 @@ pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>  		return status;
>  	}
>  
> +	if (!io_ops->alloc_lseg || !io_ops->free_lseg) {
> +		printk(KERN_ERR "%s Layout driver must provide "
> +		       "alloc_lseg and free_lseg.\n", __func__);
> +		return status;
> +	}
> +
>  	spin_lock(&pnfs_spinlock);
>  	if (!find_pnfs_driver_locked(ld_type->id)) {
>  		list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
> @@ -153,6 +159,10 @@ pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>  }
>  EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
>  
> +/*
> + * pNFS client layout cache
> + */
> +
>  static void
>  get_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>  {
> @@ -175,6 +185,15 @@ put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>  	}
>  }
>  
> +void
> +put_layout_hdr(struct inode *inode)
> +{
> +	spin_lock(&inode->i_lock);
> +	put_layout_hdr_locked(NFS_I(inode)->layout);
> +	spin_unlock(&inode->i_lock);
> +
> +}
> +
>  static void
>  init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
>  {
> @@ -191,7 +210,7 @@ destroy_lseg(struct kref *kref)
>  	struct pnfs_layout_hdr *local = lseg->layout;
>  
>  	dprintk("--> %s\n", __func__);
> -	kfree(lseg);
> +	PNFS_LD_IO_OPS(local)->free_lseg(lseg);

Where is PNFS_LD_IO_OPS() defined? Besides, I thought we agreed to get
rid of that.

>  	/* Matched by get_layout_hdr_locked in pnfs_insert_layout */
>  	put_layout_hdr_locked(local);
>  }
> @@ -226,6 +245,7 @@ pnfs_clear_lseg_list(struct pnfs_layout_hdr *lo)
>  	/* List does not take a reference, so no need for put here */
>  	list_del_init(&lo->layouts);
>  	spin_unlock(&clp->cl_lock);
> +	pnfs_set_layout_stateid(lo, &zero_stateid);
>  
>  	dprintk("%s:Return\n", __func__);
>  }
> @@ -268,40 +288,120 @@ pnfs_destroy_all_layouts(struct nfs_client *clp)
>  	}
>  }
>  
> -static void pnfs_insert_layout(struct pnfs_layout_hdr *lo,
> -			       struct pnfs_layout_segment *lseg);
> +void
> +pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
> +			const nfs4_stateid *stateid)
> +{
> +	write_seqlock(&lo->seqlock);
> +	memcpy(lo->stateid.data, stateid->data, sizeof(lo->stateid.data));
> +	write_sequnlock(&lo->seqlock);
> +}
> +
> +void
> +pnfs_get_layout_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo)
> +{
> +	int seq;
>  
> -/* Get layout from server. */
> +	dprintk("--> %s\n", __func__);
> +
> +	do {
> +		seq = read_seqbegin(&lo->seqlock);
> +		memcpy(dst->data, lo->stateid.data,
> +		       sizeof(lo->stateid.data));
> +	} while (read_seqretry(&lo->seqlock, seq));
> +
> +	dprintk("<-- %s\n", __func__);
> +}
> +
> +static void
> +pnfs_layout_from_open_stateid(struct pnfs_layout_hdr *lo,
> +			      struct nfs4_state *state)
> +{
> +	int seq;
> +
> +	dprintk("--> %s\n", __func__);
> +
> +	write_seqlock(&lo->seqlock);
> +	/* Zero stateid, which is illegal to use in layout, is our
> +	 * marker for an un-initialized stateid.
> +	 */

Isn't it easier just to have a flag in the layout?

> +	if (!memcmp(lo->stateid.data, &zero_stateid, NFS4_STATEID_SIZE))
> +		do {
> +			seq = read_seqbegin(&state->seqlock);
> +			memcpy(lo->stateid.data, state->stateid.data,
> +					sizeof(state->stateid.data));
> +		} while (read_seqretry(&state->seqlock, seq));
> +	write_sequnlock(&lo->seqlock);

...and if memcmp(), is the caller supposed to detect that nothing was
done?

> +	dprintk("<-- %s\n", __func__);
> +}
> +
> +/*
> +* Get layout from server.
> +*    for now, assume that whole file layouts are requested.
> +*    arg->offset: 0
> +*    arg->length: all ones
> +*/
>  static struct pnfs_layout_segment *
>  send_layoutget(struct pnfs_layout_hdr *lo,
>  	   struct nfs_open_context *ctx,
>  	   u32 iomode)
>  {
>  	struct inode *ino = lo->inode;
> -	struct pnfs_layout_segment *lseg;
> +	struct nfs_server *server = NFS_SERVER(ino);
> +	struct nfs4_layoutget *lgp;
> +	struct pnfs_layout_segment *lseg = NULL;
>  
> -	/* Lets pretend we sent LAYOUTGET and got a response */
> -	lseg = kzalloc(sizeof(*lseg), GFP_KERNEL);
> +	dprintk("--> %s\n", __func__);
> +
> +	BUG_ON(ctx == NULL);
> +	lgp = kzalloc(sizeof(*lgp), GFP_KERNEL);
> +	if (lgp == NULL) {
> +		put_layout_hdr(lo->inode);
> +		return NULL;
> +	}
> +	lgp->args.minlength = NFS4_MAX_UINT64;
> +	lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
> +	lgp->args.range.iomode = iomode;
> +	lgp->args.range.offset = 0;
> +	lgp->args.range.length = NFS4_MAX_UINT64;
> +	lgp->args.type = server->pnfs_curr_ld->id;
> +	lgp->args.inode = ino;
> +	lgp->args.ctx = get_nfs_open_context(ctx);
> +	lgp->lsegpp = &lseg;
> +
> +	if (!memcmp(lo->stateid.data, &zero_stateid, NFS4_STATEID_SIZE))
> +		pnfs_layout_from_open_stateid(NFS_I(ino)->layout, ctx->state);

Why do an extra memcmp() here?

> +
> +	/* Synchronously retrieve layout information from server and
> +	 * store in lseg.
> +	 */
> +	nfs4_proc_layoutget(lgp);
>  	if (!lseg) {
> +		/* remember that LAYOUTGET failed and suspend trying */
>  		set_bit(lo_fail_bit(iomode), &lo->state);
> -		spin_lock(&ino->i_lock);
> -		put_layout_hdr_locked(lo);
> -		spin_unlock(&ino->i_lock);
> -		return NULL;
>  	}
> -	init_lseg(lo, lseg);
> -	lseg->iomode = IOMODE_RW;
> -	spin_lock(&ino->i_lock);
> -	pnfs_insert_layout(lo, lseg);
> -	put_layout_hdr_locked(lo);
> -	spin_unlock(&ino->i_lock);
>  	return lseg;
>  }
>  
> +/*
> + * Compare two layout segments for sorting into layout cache.
> + * We want to preferentially return RW over RO layouts, so ensure those
> + * are seen first.
> + */
> +static s64
> +cmp_layout(u32 iomode1, u32 iomode2)
> +{
> +	/* read > read/write */
> +	return (int)(iomode2 == IOMODE_READ) - (int)(iomode1 == IOMODE_READ);
> +}
> +
>  static void
>  pnfs_insert_layout(struct pnfs_layout_hdr *lo,
>  		   struct pnfs_layout_segment *lseg)
>  {
> +	struct pnfs_layout_segment *lp;
> +	int found = 0;
> +
>  	dprintk("%s:Begin\n", __func__);
>  
>  	assert_spin_locked(&lo->inode->i_lock);
> @@ -313,13 +413,28 @@ pnfs_insert_layout(struct pnfs_layout_hdr *lo,
>  		list_add_tail(&lo->layouts, &clp->cl_layouts);
>  		spin_unlock(&clp->cl_lock);
>  	}
> -	/* STUB - add the constructed lseg if necessary */
> -	if (list_empty(&lo->segs)) {
> +	list_for_each_entry(lp, &lo->segs, fi_list) {
> +		if (cmp_layout(lp->range.iomode, lseg->range.iomode) > 0)
> +			continue;
> +		list_add_tail(&lseg->fi_list, &lp->fi_list);
> +		dprintk("%s: inserted lseg %p "
> +			"iomode %d offset %llu length %llu before "
> +			"lp %p iomode %d offset %llu length %llu\n",
> +			__func__, lseg, lseg->range.iomode,
> +			lseg->range.offset, lseg->range.length,
> +			lp, lp->range.iomode, lp->range.offset,
> +			lp->range.length);
> +		found = 1;
> +		break;
> +	}
> +	if (!found) {
>  		list_add_tail(&lseg->fi_list, &lo->segs);
> -		get_layout_hdr_locked(lo);
> -		dprintk("%s: inserted lseg %p iomode %d at tail\n",
> -			__func__, lseg, lseg->iomode);
> +		dprintk("%s: inserted lseg %p "
> +			"iomode %d offset %llu length %llu at tail\n",
> +			__func__, lseg, lseg->range.iomode,
> +			lseg->range.offset, lseg->range.length);
>  	}
> +	get_layout_hdr_locked(lo);
>  
>  	dprintk("%s:Return\n", __func__);
>  }
> @@ -335,6 +450,7 @@ alloc_init_layout_hdr(struct inode *ino)
>  	lo->refcount = 1;
>  	INIT_LIST_HEAD(&lo->layouts);
>  	INIT_LIST_HEAD(&lo->segs);
> +	seqlock_init(&lo->seqlock);
>  	lo->inode = ino;
>  	return lo;
>  }
> @@ -362,11 +478,46 @@ pnfs_find_alloc_layout(struct inode *ino)
>  	return nfsi->layout;
>  }
>  
> -/* STUB - LAYOUTGET never succeeds, so cache is empty */
> +/*
> + * iomode matching rules:
> + * iomode	lseg	match
> + * -----	-----	-----
> + * ANY		READ	true
> + * ANY		RW	true
> + * RW		READ	false
> + * RW		RW	true
> + * READ		READ	true
> + * READ		RW	true
> + */
> +static int
> +is_matching_lseg(struct pnfs_layout_segment *lseg, u32 iomode)
> +{
> +	return (iomode != IOMODE_RW || lseg->range.iomode == IOMODE_RW);
> +}
> +
> +/*
> + * lookup range in layout
> + */
>  static struct pnfs_layout_segment *
>  pnfs_has_layout(struct pnfs_layout_hdr *lo, u32 iomode)
>  {
> -	return NULL;
> +	struct pnfs_layout_segment *lseg, *ret = NULL;
> +
> +	dprintk("%s:Begin\n", __func__);
> +
> +	assert_spin_locked(&lo->inode->i_lock);
> +	list_for_each_entry(lseg, &lo->segs, fi_list) {
> +		if (is_matching_lseg(lseg, iomode)) {
> +			ret = lseg;
> +			break;
> +		}
> +		if (cmp_layout(iomode, lseg->range.iomode) > 0)
> +			break;
> +	}
> +
> +	dprintk("%s:Return lseg %p ref %d\n",
> +		__func__, ret, ret ? atomic_read(&ret->kref.refcount) : 0);
> +	return ret;
>  }
>  
>  /*
> @@ -403,7 +554,7 @@ pnfs_update_layout(struct inode *ino,
>  	if (test_bit(lo_fail_bit(iomode), &nfsi->layout->state))
>  		goto out_unlock;
>  
> -	get_layout_hdr_locked(lo);
> +	get_layout_hdr_locked(lo); /* Matched in nfs4_layoutget_release */
>  	spin_unlock(&ino->i_lock);
>  
>  	lseg = send_layoutget(lo, ctx, iomode);
> @@ -415,3 +566,184 @@ out_unlock:
>  	spin_unlock(&ino->i_lock);
>  	goto out;
>  }
> +
> +int
> +pnfs_layout_process(struct nfs4_layoutget *lgp)
> +{
> +	struct pnfs_layout_hdr *lo = NFS_I(lgp->args.inode)->layout;
> +	struct nfs4_layoutget_res *res = &lgp->res;
> +	struct pnfs_layout_segment *lseg;
> +	struct inode *ino = lo->inode;
> +	int status = 0;
> +
> +	/* Inject layout blob into I/O device driver */
> +	lseg = PNFS_LD_IO_OPS(lo)->alloc_lseg(lo, res);
                 ^^^^^^^^^^^^^^

> +	if (!lseg || IS_ERR(lseg)) {
> +		if (!lseg)
> +			status = -ENOMEM;
> +		else
> +			status = PTR_ERR(lseg);
> +		dprintk("%s: Could not allocate layout: error %d\n",
> +		       __func__, status);
> +		goto out;
> +	}
> +
> +	spin_lock(&ino->i_lock);
> +	init_lseg(lo, lseg);
> +	lseg->range = res->range;
> +	*lgp->lsegpp = lseg;
> +	pnfs_insert_layout(lo, lseg);
> +
> +	/* Done processing layoutget. Set the layout stateid */
> +	pnfs_set_layout_stateid(lo, &res->stateid);
> +	spin_unlock(&ino->i_lock);
> +out:
> +	return status;
> +}
> +
> +/*
> + * Device ID cache. Currently supports one layout type per struct nfs_client.
> + * Add layout type to the lookup key to expand to support multiple types.
> + */
> +int
> +nfs4_alloc_init_deviceid_cache(struct nfs_client *clp,
> +			 void (*free_callback)(struct nfs4_deviceid *))
> +{
> +	struct nfs4_deviceid_cache *c;
> +
> +	c = kzalloc(sizeof(struct nfs4_deviceid_cache), GFP_KERNEL);
> +	if (!c)
> +		return -ENOMEM;
> +	spin_lock(&clp->cl_lock);
> +	if (clp->cl_devid_cache != NULL) {
> +		atomic_inc(&clp->cl_devid_cache->dc_ref);
> +		dprintk("%s [kref [%d]]\n", __func__,
> +			atomic_read(&clp->cl_devid_cache->dc_ref));
> +		kfree(c);
> +	} else {
> +		/* kzalloc initializes hlists */
> +		spin_lock_init(&c->dc_lock);
> +		atomic_set(&c->dc_ref, 1);
> +		c->dc_free_callback = free_callback;
> +		clp->cl_devid_cache = c;
> +		dprintk("%s [new]\n", __func__);
> +	}
> +	spin_unlock(&clp->cl_lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL(nfs4_alloc_init_deviceid_cache);
> +
> +void
> +nfs4_init_deviceid_node(struct nfs4_deviceid *d)
> +{
> +	INIT_HLIST_NODE(&d->de_node);
> +	atomic_set(&d->de_ref, 1);
> +}
> +EXPORT_SYMBOL(nfs4_init_deviceid_node);
> +
> +/* Called from layoutdriver_io_operations->alloc_lseg */
> +void
> +nfs4_set_layout_deviceid(struct pnfs_layout_segment *l, struct nfs4_deviceid *d)
> +{
> +	dprintk("%s [%d]\n", __func__, atomic_read(&d->de_ref));
> +	l->deviceid = d;
> +}
> +EXPORT_SYMBOL(nfs4_set_layout_deviceid);
> +
> +/*
> + * Called from layoutdriver_io_operations->free_lseg
> + * last layout segment reference frees deviceid
> + */
> +void
> +nfs4_put_layout_deviceid(struct pnfs_layout_segment *l)
> +{
> +	struct nfs4_deviceid_cache *c =
> +		NFS_SERVER(l->layout->inode)->nfs_client->cl_devid_cache;
> +	struct pnfs_deviceid *id = &l->deviceid->de_id;
> +	struct nfs4_deviceid *d;
> +	struct hlist_node *n;
> +	long h = nfs4_deviceid_hash(id);
> +
> +	dprintk("%s [%d]\n", __func__, atomic_read(&l->deviceid->de_ref));
> +	if (!atomic_dec_and_lock(&l->deviceid->de_ref, &c->dc_lock))
> +		return;
> +
> +	hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[h], de_node)
> +		if (!memcmp(&d->de_id, id, sizeof(*id))) {
> +			hlist_del_rcu(&d->de_node);
> +			spin_unlock(&c->dc_lock);
> +			synchronize_rcu();
> +			c->dc_free_callback(l->deviceid);
> +			return;
> +		}
> +	spin_unlock(&c->dc_lock);
> +}
> +EXPORT_SYMBOL(nfs4_put_layout_deviceid);
> +
> +/* Find and reference a deviceid */
> +struct nfs4_deviceid *
> +nfs4_find_get_deviceid(struct nfs4_deviceid_cache *c, struct pnfs_deviceid *id)
> +{
> +	struct nfs4_deviceid *d;
> +	struct hlist_node *n;
> +	long hash = nfs4_deviceid_hash(id);
> +
> +	dprintk("--> %s hash %ld\n", __func__, hash);
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[hash], de_node) {
> +		if (!memcmp(&d->de_id, id, sizeof(*id))) {
> +			if (!atomic_inc_not_zero(&d->de_ref)) {
> +				goto fail;
> +			} else {
> +				rcu_read_unlock();
> +				return d;
> +			}
> +		}
> +	}
> +fail:
> +	rcu_read_unlock();
> +	return NULL;
> +}
> +EXPORT_SYMBOL(nfs4_find_get_deviceid);
> +
> +/*
> + * Add a deviceid to the cache.
> + * GETDEVICEINFOs for same deviceid can race. If deviceid is found, discard new
> + */
> +struct nfs4_deviceid *
> +nfs4_add_deviceid(struct nfs4_deviceid_cache *c, struct nfs4_deviceid *new)
> +{
> +	struct nfs4_deviceid *d;
> +	struct hlist_node *n;
> +	long hash = nfs4_deviceid_hash(&new->de_id);
> +
> +	dprintk("--> %s hash %ld\n", __func__, hash);
> +	spin_lock(&c->dc_lock);
> +	hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[hash], de_node) {
> +		if (!memcmp(&d->de_id, &new->de_id, sizeof(new->de_id))) {
> +			spin_unlock(&c->dc_lock);
> +			dprintk("%s [discard]\n", __func__);
> +			c->dc_free_callback(new);
> +			return d;
> +		}
> +	}
> +	hlist_add_head_rcu(&new->de_node, &c->dc_deviceids[hash]);
> +	spin_unlock(&c->dc_lock);
> +	dprintk("%s [new]\n", __func__);
> +	return new;
> +}
> +EXPORT_SYMBOL(nfs4_add_deviceid);
> +
> +void
> +nfs4_put_deviceid_cache(struct nfs_client *clp)
> +{
> +	struct nfs4_deviceid_cache *local = clp->cl_devid_cache;
> +
> +	dprintk("--> %s cl_devid_cache %p\n", __func__, clp->cl_devid_cache);
> +	if (atomic_dec_and_lock(&local->dc_ref, &clp->cl_lock)) {
> +		clp->cl_devid_cache = NULL;
> +		spin_unlock(&clp->cl_lock);
> +		kfree(local);
> +	}
> +}
> +EXPORT_SYMBOL(nfs4_put_deviceid_cache);
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index dac6a72..d343f83 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -12,11 +12,14 @@
>  
>  struct pnfs_layout_segment {
>  	struct list_head fi_list;
> -	u32 iomode;
> +	struct pnfs_layout_range range;
>  	struct kref kref;
>  	struct pnfs_layout_hdr *layout;
> +	struct nfs4_deviceid *deviceid;
>  };
>  
> +#define NFS4_PNFS_DEVICEID4_SIZE 16
> +
>  #ifdef CONFIG_NFS_V4_1
>  
>  #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
> @@ -38,17 +41,86 @@ struct pnfs_layout_hdr {
>  	int			refcount;
>  	struct list_head	layouts;   /* other client layouts */
>  	struct list_head	segs;      /* layout segments list */
> +	seqlock_t		seqlock;   /* Protects the stateid */
> +	nfs4_stateid		stateid;
>  	unsigned long		state;
>  	struct inode		*inode;
>  };
>  
>  /* Layout driver I/O operations. */
>  struct layoutdriver_io_operations {
> +	struct pnfs_layout_segment * (*alloc_lseg) (struct pnfs_layout_hdr *layoutid, struct nfs4_layoutget_res *lgr);
> +	void (*free_lseg) (struct pnfs_layout_segment *lseg);
> +
>  	/* Registration information for a new mounted file system */
>  	int (*initialize_mountpoint) (struct nfs_client *);
>  	int (*uninitialize_mountpoint) (struct nfs_client *);
>  };
>  
> +struct pnfs_deviceid {
> +	char data[NFS4_PNFS_DEVICEID4_SIZE];
> +};
> +
> +struct pnfs_device {
> +	struct pnfs_deviceid dev_id;
> +	unsigned int  layout_type;
> +	unsigned int  mincount;
> +	struct page **pages;
> +	void          *area;
> +	unsigned int  pgbase;
> +	unsigned int  pglen;
> +	unsigned int  dev_notify_types;
> +};
> +
> +/*
> + * Device ID RCU cache. A device ID is unique per client ID and layout type.
> + */
> +#define NFS4_DEVICE_ID_HASH_BITS	5
> +#define NFS4_DEVICE_ID_HASH_SIZE	(1 << NFS4_DEVICE_ID_HASH_BITS)
> +#define NFS4_DEVICE_ID_HASH_MASK	(NFS4_DEVICE_ID_HASH_SIZE - 1)
> +
> +static inline u32
> +nfs4_deviceid_hash(struct pnfs_deviceid *id)
> +{
> +	unsigned char *cptr = (unsigned char *)id->data;
> +	unsigned int nbytes = NFS4_PNFS_DEVICEID4_SIZE;
> +	u32 x = 0;
> +
> +	while (nbytes--) {
> +		x *= 37;
> +		x += *cptr++;
> +	}
> +	return x & NFS4_DEVICE_ID_HASH_MASK;
> +}
> +
> +struct nfs4_deviceid_cache {
> +	spinlock_t		dc_lock;
> +	atomic_t		dc_ref;
> +	void			(*dc_free_callback)(struct nfs4_deviceid *);
> +	struct hlist_head	dc_deviceids[NFS4_DEVICE_ID_HASH_SIZE];
> +	struct hlist_head	dc_to_free;
> +};
> +
> +/* Device ID cache node */
> +struct nfs4_deviceid {
> +	struct hlist_node	de_node;
> +	struct pnfs_deviceid	de_id;
> +	atomic_t		de_ref;
> +};
> +
> +extern int nfs4_alloc_init_deviceid_cache(struct nfs_client *,
> +				void (*free_callback)(struct nfs4_deviceid *));
> +extern void nfs4_put_deviceid_cache(struct nfs_client *);
> +extern void nfs4_init_deviceid_node(struct nfs4_deviceid *);
> +extern struct nfs4_deviceid *nfs4_find_get_deviceid(
> +				struct nfs4_deviceid_cache *,
> +				struct pnfs_deviceid *);
> +extern struct nfs4_deviceid *nfs4_add_deviceid(struct nfs4_deviceid_cache *,
> +				struct nfs4_deviceid *);
> +extern void nfs4_set_layout_deviceid(struct pnfs_layout_segment *,
> +				struct nfs4_deviceid *);
> +extern void nfs4_put_layout_deviceid(struct pnfs_layout_segment *);
> +
>  extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
>  extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
>  
> @@ -58,13 +130,30 @@ PNFS_NFS_SERVER(struct pnfs_layout_hdr *lo)
>  	return NFS_SERVER(lo->inode);
>  }
>  
> +static inline struct layoutdriver_io_operations *
> +PNFS_LD_IO_OPS(struct pnfs_layout_hdr *lo)
> +{
> +	return PNFS_NFS_SERVER(lo)->pnfs_curr_ld->ld_io_ops;
> +}
> +
> +/* nfs4proc.c */
> +extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
> +				   struct pnfs_device *dev);
> +extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
> +
> +/* pnfs.c */
>  struct pnfs_layout_segment *
>  pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
>  		   enum pnfs_iomode access_type);
>  void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
>  void unset_pnfs_layoutdriver(struct nfs_server *);
> +int pnfs_layout_process(struct nfs4_layoutget *lgp);
> +void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
> +			     const nfs4_stateid *stateid);
>  void pnfs_destroy_layout(struct nfs_inode *);
>  void pnfs_destroy_all_layouts(struct nfs_client *);
> +void put_layout_hdr(struct inode *inode);
> +void pnfs_get_layout_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo);
>  
> 
>  static inline int lo_fail_bit(u32 iomode)
> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> index 2dde7c8..dcdd11c 100644
> --- a/include/linux/nfs4.h
> +++ b/include/linux/nfs4.h
> @@ -545,6 +545,8 @@ enum {
>  	NFSPROC4_CLNT_SEQUENCE,
>  	NFSPROC4_CLNT_GET_LEASE_TIME,
>  	NFSPROC4_CLNT_RECLAIM_COMPLETE,
> +	NFSPROC4_CLNT_LAYOUTGET,
> +	NFSPROC4_CLNT_GETDEVICEINFO,
>  };
>  
>  /* nfs41 types */
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index e670a9c..7512886 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -83,6 +83,7 @@ struct nfs_client {
>  	u32			cl_exchange_flags;
>  	struct nfs4_session	*cl_session; 	/* sharred session */
>  	struct list_head	cl_layouts;
> +	struct nfs4_deviceid_cache *cl_devid_cache; /* pNFS deviceid cache */
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  #ifdef CONFIG_NFS_FSCACHE
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 8a2c228..c4c6a61 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -186,6 +186,55 @@ struct nfs4_get_lease_time_res {
>  	struct nfs4_sequence_res	lr_seq_res;
>  };
>  
> +#define PNFS_LAYOUT_MAXSIZE 4096
> +
> +struct nfs4_layoutdriver_data {
> +	__u32 len;
> +	void *buf;
> +};
> +
> +struct pnfs_layout_range {
> +	u32 iomode;
> +	u64 offset;
> +	u64 length;
> +};
> +
> +struct nfs4_layoutget_args {
> +	__u32 type;
> +	struct pnfs_layout_range range;
> +	__u64 minlength;
> +	__u32 maxcount;
> +	struct inode *inode;
> +	struct nfs_open_context *ctx;
> +	struct nfs4_sequence_args seq_args;
> +};
> +
> +struct nfs4_layoutget_res {
> +	__u32 return_on_close;
> +	struct pnfs_layout_range range;
> +	__u32 type;
> +	nfs4_stateid stateid;
> +	struct nfs4_layoutdriver_data layout;
> +	struct nfs4_sequence_res seq_res;
> +};
> +
> +struct nfs4_layoutget {
> +	struct nfs4_layoutget_args args;
> +	struct nfs4_layoutget_res res;
> +	struct pnfs_layout_segment **lsegpp;
> +	int status;
> +};
> +
> +struct nfs4_getdeviceinfo_args {
> +	struct pnfs_device *pdev;
> +	struct nfs4_sequence_args seq_args;
> +};
> +
> +struct nfs4_getdeviceinfo_res {
> +	struct pnfs_device *pdev;
> +	struct nfs4_sequence_res seq_res;
> +};
> +
>  /*
>   * Arguments to the open call.
>   */



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/13] RFC: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure
  2010-09-02 18:00 ` [PATCH 13/13] RFC: pnfs: filelayout: add driver's " Fred Isaman
@ 2010-09-10 20:33   ` Trond Myklebust
  0 siblings, 0 replies; 55+ messages in thread
From: Trond Myklebust @ 2010-09-10 20:33 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
> From: The pNFS Team <linux-nfs@vger.kernel.org>
> 
> Implement the driver's io_ops->alloc_lseg and free_lseg functions,
> which integrate into the deviceid cache and calls out to
> nfs4_proc_getdeviceinfo when necessary.
> 
> Signed-off-by: TBD - melding/reorganization of several patches
> ---
>  fs/nfs/Makefile            |    2 +-
>  fs/nfs/client.c            |    1 +
>  fs/nfs/nfs4filelayout.c    |  203 ++++++++++++++++++++-
>  fs/nfs/nfs4filelayout.h    |   74 +++++++
>  fs/nfs/nfs4filelayoutdev.c |  450 ++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 728 insertions(+), 2 deletions(-)
>  create mode 100644 fs/nfs/nfs4filelayout.h
>  create mode 100644 fs/nfs/nfs4filelayoutdev.c
> 
> diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
> index 08a8889..4776ff9 100644
> --- a/fs/nfs/Makefile
> +++ b/fs/nfs/Makefile
> @@ -20,4 +20,4 @@ nfs-$(CONFIG_SYSCTL) += sysctl.o
>  nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
>  
>  obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
> -nfs_layout_nfsv41_files-y := nfs4filelayout.o
> +nfs_layout_nfsv41_files-y := nfs4filelayout.o nfs4filelayoutdev.o
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index 6fc5c84..bac8ac2 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -255,6 +255,7 @@ void nfs_put_client(struct nfs_client *clp)
>  		nfs_free_client(clp);
>  	}
>  }
> +EXPORT_SYMBOL(nfs_put_client);
>  
>  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>  /*
> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
> index c685196..0104d09 100644
> --- a/fs/nfs/nfs4filelayout.c
> +++ b/fs/nfs/nfs4filelayout.c
> @@ -30,7 +30,9 @@
>   */
>  
>  #include <linux/nfs_fs.h>
> -#include "pnfs.h"
> +
> +#include "internal.h"
> +#include "nfs4filelayout.h"
>  
>  #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
>  
> @@ -41,18 +43,217 @@ MODULE_DESCRIPTION("The NFSv4 file layout driver");
>  int
>  filelayout_initialize_mountpoint(struct nfs_client *clp)
>  {
> +	int status = nfs4_alloc_init_deviceid_cache(clp,
> +						nfs4_fl_free_deviceid_callback);
> +	if (status) {
> +		printk(KERN_WARNING "%s: deviceid cache could not be "
> +			"initialized\n", __func__);
> +		return status;
> +	}
> +	dprintk("%s: deviceid cache has been initialized successfully\n",
> +		__func__);
>  	return 0;
>  }
>  
> +/* Uninitialize a mountpoint by destroying its device list */
>  int
>  filelayout_uninitialize_mountpoint(struct nfs_client *clp)
>  {
>  	dprintk("--> %s\n", __func__);
>  
> +	if (clp->cl_devid_cache)
> +		nfs4_put_deviceid_cache(clp);
> +	return 0;
> +}
> +
> +/*
> + * filelayout_check_layout()
> + *
> + * Make sure layout segment parameters are sane WRT the device.
> + * At this point no generic layer initialization of the lseg has occurred,
> + * and nothing has been added to the layout_hdr cache.
> + *
> + */
> +static int
> +filelayout_check_layout(struct pnfs_layout_hdr *lo,
> +			struct nfs4_filelayout_segment *fl,
> +			struct nfs4_layoutget_res *lgr)
> +{
> +	struct pnfs_layout_segment *lseg = &fl->generic_hdr;
> +	struct nfs4_file_layout_dsaddr *dsaddr;
> +	int status = -EINVAL;
> +	struct nfs_server *nfss = PNFS_NFS_SERVER(lo);
> +
> +	dprintk("--> %s\n", __func__);
> +
> +	if (fl->pattern_offset > lgr->range.offset) {
> +		dprintk("%s pattern_offset %lld to large\n",
> +				__func__, fl->pattern_offset);
> +		goto out;
> +	}
> +
> +	if (fl->stripe_unit % PAGE_SIZE) {
> +		dprintk("%s Stripe unit (%u) not page aligned\n",
> +			__func__, fl->stripe_unit);
> +		goto out;
> +	}
> +
> +	/* find and reference the deviceid */
> +	dsaddr = nfs4_fl_find_get_deviceid(nfss->nfs_client, &fl->dev_id);
> +	if (dsaddr == NULL) {
> +		dsaddr = get_device_info(lo->inode, &fl->dev_id);
> +		if (dsaddr == NULL)
> +			goto out;
> +	}
> +
> +	nfs4_set_layout_deviceid(lseg, &dsaddr->deviceid);
> +
> +	if (fl->first_stripe_index < 0 ||
> +	    fl->first_stripe_index >= dsaddr->stripe_count) {
> +		dprintk("%s Bad first_stripe_index %d\n",
> +				__func__, fl->first_stripe_index);
> +		goto out_put;
> +	}
> +
> +	if ((fl->stripe_type == STRIPE_SPARSE &&
> +	    fl->num_fh > 1 && fl->num_fh != dsaddr->ds_num) ||
> +	    (fl->stripe_type == STRIPE_DENSE &&
> +	    fl->num_fh != dsaddr->stripe_count)) {
> +		dprintk("%s num_fh %u not valid for given packing\n",
> +			__func__, fl->num_fh);
> +		goto out_put;
> +	}
> +
> +	if (fl->stripe_unit % nfss->rsize || fl->stripe_unit % nfss->wsize) {
> +		dprintk("%s Stripe unit (%u) not aligned with rsize %u "
> +			"wsize %u\n", __func__, fl->stripe_unit, nfss->rsize,
> +			nfss->wsize);
> +	}
> +
> +	status = 0;
> +out:
> +	dprintk("--> %s returns %d\n", __func__, status);
> +	return status;
> +out_put:
> +	nfs4_put_layout_deviceid(lseg);
> +	goto out;
> +}
> +
> +static void _filelayout_free_lseg(struct nfs4_filelayout_segment *fl);
> +static void filelayout_free_fh_array(struct nfs4_filelayout_segment *fl);
> +
> +static int
> +filelayout_decode_layout(struct pnfs_layout_hdr *flo,
> +		      struct nfs4_filelayout_segment *fl,
> +		      struct nfs4_layoutget_res *lgr)
> +{
> +	uint32_t *p = (uint32_t *)lgr->layout.buf;
> +	uint32_t nfl_util;
> +	int i;
> +
> +	dprintk("%s: set_layout_map Begin\n", __func__);
> +
> +	memcpy(&fl->dev_id, p, sizeof(fl->dev_id));
> +	p += XDR_QUADLEN(NFS4_PNFS_DEVICEID4_SIZE);
> +	print_deviceid(&fl->dev_id);
> +
> +	nfl_util = be32_to_cpup(p++);
> +	if (nfl_util & NFL4_UFLG_COMMIT_THRU_MDS)
> +		fl->commit_through_mds = 1;
> +	if (nfl_util & NFL4_UFLG_DENSE)
> +		fl->stripe_type = STRIPE_DENSE;
> +	else
> +		fl->stripe_type = STRIPE_SPARSE;
> +	fl->stripe_unit = nfl_util & ~NFL4_UFLG_MASK;
> +
> +	fl->first_stripe_index = be32_to_cpup(p++);
> +	p = xdr_decode_hyper(p, &fl->pattern_offset);
> +	fl->num_fh = be32_to_cpup(p++);
> +
> +	dprintk("%s: nfl_util 0x%X num_fh %u fsi %u po %llu\n",
> +		__func__, nfl_util, fl->num_fh, fl->first_stripe_index,
> +		fl->pattern_offset);
> +
> +	if (fl->num_fh * sizeof(struct nfs_fh) > 2*PAGE_SIZE) {
> +		fl->fh_array = vmalloc(fl->num_fh * sizeof(struct nfs_fh));
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Please do this differently. vmalloc() use is frowned upon unless you
really need _contiguous_ memory. The 32-bit vmalloc address space is
limited, and easily exhausted.

In this case you could instead allocate an array of pointers to struct
nfs_fh.

> +		if (fl->fh_array)
> +			memset(fl->fh_array, 0,
> +				fl->num_fh * sizeof(struct nfs_fh));
> +	} else {
> +		fl->fh_array = kzalloc(fl->num_fh * sizeof(struct nfs_fh),
> +					GFP_KERNEL);
> +	}
> +	if (!fl->fh_array)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < fl->num_fh; i++) {
> +		/* fh */
> +		fl->fh_array[i].size = be32_to_cpup(p++);
> +		if (sizeof(struct nfs_fh) < fl->fh_array[i].size) {
> +			printk(KERN_ERR "Too big fh %d received %d\n",
> +				i, fl->fh_array[i].size);
> +			/* Layout is now invalid, pretend it doesn't exist */
> +			filelayout_free_fh_array(fl);
> +			fl->num_fh = 0;
> +			break;
> +		}
> +		memcpy(fl->fh_array[i].data, p, fl->fh_array[i].size);
> +		p += XDR_QUADLEN(fl->fh_array[i].size);
> +		dprintk("DEBUG: %s: fh len %d\n", __func__,
> +					fl->fh_array[i].size);
> +	}
> +
>  	return 0;
>  }
>  
> +static struct pnfs_layout_segment *
> +filelayout_alloc_lseg(struct pnfs_layout_hdr *layoutid,
> +		      struct nfs4_layoutget_res *lgr)
> +{
> +	struct nfs4_filelayout_segment *fl;
> +	int rc;
> +
> +	dprintk("--> %s\n", __func__);
> +	fl = kzalloc(sizeof(*fl), GFP_KERNEL);
> +	if (!fl)
> +		return NULL;
> +
> +	rc = filelayout_decode_layout(layoutid, fl, lgr);
> +	if (rc != 0 || filelayout_check_layout(layoutid, fl, lgr)) {
> +		_filelayout_free_lseg(fl);
> +		return NULL;
> +	}
> +	return &fl->generic_hdr;
> +}
> +
> +static void filelayout_free_fh_array(struct nfs4_filelayout_segment *fl)
> +{
> +	if (fl->num_fh * sizeof(struct nfs_fh) > 2*PAGE_SIZE)
> +		vfree(fl->fh_array);

See above.

> +	else
> +		kfree(fl->fh_array);
> +
> +	fl->fh_array = NULL;
> +}
> +
> +static void
> +_filelayout_free_lseg(struct nfs4_filelayout_segment *fl)
> +{
> +	filelayout_free_fh_array(fl);
> +	kfree(fl);
> +}
> +
> +static void
> +filelayout_free_lseg(struct pnfs_layout_segment *lseg)
> +{
> +	dprintk("--> %s\n", __func__);
> +	nfs4_put_layout_deviceid(lseg);
> +	_filelayout_free_lseg(FILELAYOUT_LSEG(lseg));
> +}
> +
>  struct layoutdriver_io_operations filelayout_io_operations = {
> +	.alloc_lseg              = filelayout_alloc_lseg,
> +	.free_lseg               = filelayout_free_lseg,
>  	.initialize_mountpoint   = filelayout_initialize_mountpoint,
>  	.uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
>  };
> diff --git a/fs/nfs/nfs4filelayout.h b/fs/nfs/nfs4filelayout.h
> new file mode 100644
> index 0000000..2467b5f
> --- /dev/null
> +++ b/fs/nfs/nfs4filelayout.h
> @@ -0,0 +1,74 @@
> +/*
> + *  NFSv4 file layout driver data structures.
> + *
> + *  Copyright (c) 2002 The Regents of the University of Michigan.
> + *  All rights reserved.
> + *
> + *  Dean Hildebrand   <dhildebz@umich.edu>
> + */
> +
> +#ifndef FS_NFS_NFS4FILELAYOUT_H
> +#define FS_NFS_NFS4FILELAYOUT_H
> +
> +#include "pnfs.h"
> +
> +/*
> + * Field testing shows we need to support upto 4096 stripe indices.
> + * We store each index as a u8 (u32 on the wire) to keep the memory footprint
> + * reasonable. This in turn means we support a maximum of 256
> + * RFC 5661 multipath_list4 structures.
> + */
> +#define NFS4_PNFS_MAX_STRIPE_CNT 4096
> +#define NFS4_PNFS_MAX_MULTI_CNT  256 /* 256 fit into a u8 stripe_index */
> +
> +enum stripetype4 {
> +	STRIPE_SPARSE = 1,
> +	STRIPE_DENSE = 2
> +};
> +
> +/* Individual ip address */
> +struct nfs4_pnfs_ds {
> +	struct list_head	ds_node;  /* nfs4_pnfs_dev_hlist dev_dslist */
> +	u32			ds_ip_addr;
> +	u32			ds_port;
> +	struct nfs_client	*ds_clp;
> +	atomic_t		ds_count;
> +};
> +
> +struct nfs4_file_layout_dsaddr {
> +	struct nfs4_deviceid	deviceid;
> +	u32			stripe_count;
> +	u8			*stripe_indices;
> +	u32			ds_num;
> +	struct nfs4_pnfs_ds	*ds_list[1];
> +};
> +
> +struct nfs4_filelayout_segment {
> +	struct pnfs_layout_segment generic_hdr;
> +	u32 stripe_type;
> +	u32 commit_through_mds;
> +	u32 stripe_unit;
> +	u32 first_stripe_index;
> +	u64 pattern_offset;
> +	struct pnfs_deviceid dev_id;
> +	unsigned int num_fh;
> +	struct nfs_fh *fh_array;
> +};
> +
> +static inline struct nfs4_filelayout_segment *
> +FILELAYOUT_LSEG(struct pnfs_layout_segment *lseg)
> +{
> +	return container_of(lseg,
> +			    struct nfs4_filelayout_segment,
> +			    generic_hdr);
> +}
> +
> +extern void nfs4_fl_free_deviceid_callback(struct nfs4_deviceid *);
> +extern void print_ds(struct nfs4_pnfs_ds *ds);
> +extern void print_deviceid(struct pnfs_deviceid *dev_id);
> +extern struct nfs4_file_layout_dsaddr *
> +nfs4_fl_find_get_deviceid(struct nfs_client *, struct pnfs_deviceid *dev_id);
> +struct nfs4_file_layout_dsaddr *
> +get_device_info(struct inode *inode, struct pnfs_deviceid *dev_id);
> +
> +#endif /* FS_NFS_NFS4FILELAYOUT_H */
> diff --git a/fs/nfs/nfs4filelayoutdev.c b/fs/nfs/nfs4filelayoutdev.c
> new file mode 100644
> index 0000000..833ff9a
> --- /dev/null
> +++ b/fs/nfs/nfs4filelayoutdev.c
> @@ -0,0 +1,450 @@
> +/*
> + *  Device operations for the pnfs nfs4 file layout driver.
> + *
> + *  Copyright (c) 2002
> + *  The Regents of the University of Michigan
> + *  All Rights Reserved
> + *
> + *  Dean Hildebrand <dhildebz@umich.edu>
> + *  Garth Goodson   <Garth.Goodson@netapp.com>
> + *
> + *  Permission is granted to use, copy, create derivative works, and
> + *  redistribute this software and such derivative works for any purpose,
> + *  so long as the name of the University of Michigan is not used in
> + *  any advertising or publicity pertaining to the use or distribution
> + *  of this software without specific, written prior authorization. If
> + *  the above copyright notice or any other identification of the
> + *  University of Michigan is included in any copy of any portion of
> + *  this software, then the disclaimer below must also be included.
> + *
> + *  This software is provided as is, without representation or warranty
> + *  of any kind either express or implied, including without limitation
> + *  the implied warranties of merchantability, fitness for a particular
> + *  purpose, or noninfringement.  The Regents of the University of
> + *  Michigan shall not be liable for any damages, including special,
> + *  indirect, incidental, or consequential damages, with respect to any
> + *  claim arising out of or in connection with the use of the software,
> + *  even if it has been or is hereafter advised of the possibility of
> + *  such damages.
> + */
> +
> +#include <linux/nfs_fs.h>
> +
> +#include "internal.h"
> +#include "nfs4filelayout.h"
> +
> +#define NFSDBG_FACILITY		NFSDBG_PNFS_LD
> +
> +/*
> + * Data server cache
> + *
> + * Data servers can be mapped to different device ids.
> + * nfs4_pnfs_ds reference counting
> + *   - set to 1 on allocation
> + *   - incremented when a device id maps a data server already in the cache.
> + *   - decremented when deviceid is removed from the cache.
> + */
> +DEFINE_SPINLOCK(nfs4_ds_cache_lock);
> +static LIST_HEAD(nfs4_data_server_cache);
> +
> +/* Debug routines */
> +void
> +print_ds(struct nfs4_pnfs_ds *ds)
> +{
> +	if (ds == NULL) {
> +		dprintk("%s NULL device\n", __func__);
> +		return;
> +	}
> +	dprintk("        ip_addr %x port %hu\n"
> +		"        ref count %d\n"
> +		"        client %p\n"
> +		"        cl_exchange_flags %x\n",
> +		ntohl(ds->ds_ip_addr), ntohs(ds->ds_port),
> +		atomic_read(&ds->ds_count), ds->ds_clp,
> +		ds->ds_clp ? ds->ds_clp->cl_exchange_flags : 0);
> +}
> +
> +void
> +print_ds_list(struct nfs4_file_layout_dsaddr *dsaddr)
> +{
> +	int i;
> +
> +	dprintk("%s dsaddr->ds_num %d\n", __func__,
> +		dsaddr->ds_num);

Can we just do 1 test of ifdebug() at the beginning of this function
instead of doing the same test for each and every printk()?

> +	for (i = 0; i < dsaddr->ds_num; i++)
> +		print_ds(dsaddr->ds_list[i]);
> +}
> +
> +void print_deviceid(struct pnfs_deviceid *id)
> +{
> +	u32 *p = (u32 *)id;
> +
> +	dprintk("%s: device id= [%x%x%x%x]\n", __func__,
> +		p[0], p[1], p[2], p[3]);
> +}
> +
> +/* nfs4_ds_cache_lock is held */
> +static struct nfs4_pnfs_ds *
> +_data_server_lookup_locked(u32 ip_addr, u32 port)
> +{
> +	struct nfs4_pnfs_ds *ds;
> +
> +	dprintk("_data_server_lookup: ip_addr=%x port=%hu\n",
> +			ntohl(ip_addr), ntohs(port));
> +
> +	list_for_each_entry(ds, &nfs4_data_server_cache, ds_node) {
> +		if (ds->ds_ip_addr == ip_addr &&
> +		    ds->ds_port == port) {
> +			return ds;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +static void
> +destroy_ds(struct nfs4_pnfs_ds *ds)
> +{
> +	dprintk("--> %s\n", __func__);
> +	print_ds(ds);
> +
> +	if (ds->ds_clp)
> +		nfs_put_client(ds->ds_clp);
> +	kfree(ds);
> +}
> +
> +static void
> +nfs4_fl_free_deviceid(struct nfs4_file_layout_dsaddr *dsaddr)
> +{
> +	struct nfs4_pnfs_ds *ds;
> +	int i;
> +
> +	print_deviceid(&dsaddr->deviceid.de_id);
> +
> +	for (i = 0; i < dsaddr->ds_num; i++) {
> +		ds = dsaddr->ds_list[i];
> +		if (ds != NULL) {
> +			if (atomic_dec_and_lock(&ds->ds_count,
> +						&nfs4_ds_cache_lock)) {
> +				list_del_init(&ds->ds_node);
> +				spin_unlock(&nfs4_ds_cache_lock);
> +				destroy_ds(ds);
> +			}
> +		}
> +	}
> +	kfree(dsaddr->stripe_indices);
> +	kfree(dsaddr);
> +}
> +
> +void
> +nfs4_fl_free_deviceid_callback(struct nfs4_deviceid *device)
> +{
> +	struct nfs4_file_layout_dsaddr *dsaddr =
> +		container_of(device, struct nfs4_file_layout_dsaddr, deviceid);
> +
> +	nfs4_fl_free_deviceid(dsaddr);
> +}
> +
> +static struct nfs4_pnfs_ds *
> +nfs4_pnfs_ds_add(struct inode *inode, u32 ip_addr, u32 port)
> +{
> +	struct nfs4_pnfs_ds *tmp_ds, *ds;
> +
> +	ds = kzalloc(sizeof(*tmp_ds), GFP_KERNEL);
> +	if (!ds)
> +		goto out;
> +
> +	spin_lock(&nfs4_ds_cache_lock);
> +	tmp_ds = _data_server_lookup_locked(ip_addr, port);
> +	if (tmp_ds == NULL) {
> +		ds->ds_ip_addr = ip_addr;
> +		ds->ds_port = port;
> +		atomic_set(&ds->ds_count, 1);
> +		INIT_LIST_HEAD(&ds->ds_node);
> +		ds->ds_clp = NULL;
> +		list_add(&ds->ds_node, &nfs4_data_server_cache);
> +		dprintk("%s add new data server ip 0x%x\n", __func__,
> +			ds->ds_ip_addr);
> +	} else {
> +		kfree(ds);
> +		atomic_inc(&tmp_ds->ds_count);
> +		dprintk("%s data server found ip 0x%x, inc'ed ds_count to %d\n",
> +			__func__, tmp_ds->ds_ip_addr,
> +			atomic_read(&tmp_ds->ds_count));
> +		ds = tmp_ds;
> +	}
> +	spin_unlock(&nfs4_ds_cache_lock);
> +out:
> +	return ds;
> +}
> +
> +/*
> + * Currently only support ipv4, and one multi-path address.
> + */
> +static struct nfs4_pnfs_ds *
> +decode_and_add_ds(__be32 **pp, struct inode *inode)
> +{
> +	struct nfs4_pnfs_ds *ds = NULL;
> +	char *buf;
> +	const char *ipend, *pstr;
> +	u32 ip_addr, port;
> +	int nlen, rlen, i;
> +	int tmp[2];
> +	__be32 *r_netid, *r_addr, *p = *pp;
> +
> +	/* r_netid */
> +	nlen = be32_to_cpup(p++);
> +	r_netid = p;
> +	p += XDR_QUADLEN(nlen);
> +
> +	/* r_addr */
> +	rlen = be32_to_cpup(p++);
> +	r_addr = p;
> +	p += XDR_QUADLEN(rlen);
> +	*pp = p;
> +
> +	/* Check that netid is "tcp" */
> +	if (nlen != 3 ||  memcmp((char *)r_netid, "tcp", 3)) {
> +		dprintk("%s: ERROR: non ipv4 TCP r_netid\n", __func__);
> +		goto out_err;
> +	}
> +
> +	/* ipv6 length plus port is legal */
> +	if (rlen > INET6_ADDRSTRLEN + 8) {
> +		dprintk("%s Invalid address, length %d\n", __func__,
> +			rlen);
> +		goto out_err;
> +	}
> +	buf = kmalloc(rlen + 1, GFP_KERNEL);
> +	buf[rlen] = '\0';
> +	memcpy(buf, r_addr, rlen);
> +
> +	/* replace the port dots with dashes for the in4_pton() delimiter*/
> +	for (i = 0; i < 2; i++) {
> +		char *res = strrchr(buf, '.');
> +		*res = '-';
> +	}
> +
> +	/* Currently only support ipv4 address */
> +	if (in4_pton(buf, rlen, (u8 *)&ip_addr, '-', &ipend) == 0) {
> +		dprintk("%s: Only ipv4 addresses supported\n", __func__);
> +		goto out_free;
> +	}
> +
> +	/* port */
> +	pstr = ipend;
> +	sscanf(pstr, "-%d-%d", &tmp[0], &tmp[1]);
> +	port = htons((tmp[0] << 8) | (tmp[1]));
> +
> +	ds = nfs4_pnfs_ds_add(inode, ip_addr, port);
> +	dprintk("%s Decoded address and port %s\n", __func__, buf);
> +out_free:
> +	kfree(buf);
> +out_err:
> +	return ds;
> +}
> +
> +
> +
> +/*Decode opaque device data and return the result */
> +static struct nfs4_file_layout_dsaddr*
> +decode_device(struct inode *ino, struct pnfs_device *pdev)
> +{
> +	int i, dummy;
> +	u32 cnt, num;
> +	u8 *indexp;
> +	__be32 *p = (__be32 *)pdev->area, *indicesp;
> +	struct nfs4_file_layout_dsaddr *dsaddr;
> +
> +	/* Get the stripe count (number of stripe index) */
> +	cnt = be32_to_cpup(p++);
> +	dprintk("%s stripe count  %d\n", __func__, cnt);
> +	if (cnt > NFS4_PNFS_MAX_STRIPE_CNT) {
> +		printk(KERN_WARNING "%s: stripe count %d greater than "
> +		       "supported maximum %d\n", __func__,
> +			cnt, NFS4_PNFS_MAX_STRIPE_CNT);
> +		goto out_err;
> +	}
> +
> +	/* Check the multipath list count */
> +	indicesp = p;
> +	p += XDR_QUADLEN(cnt << 2);
> +	num = be32_to_cpup(p++);
> +	dprintk("%s ds_num %u\n", __func__, num);
> +	if (num > NFS4_PNFS_MAX_MULTI_CNT) {
> +		printk(KERN_WARNING "%s: multipath count %d greater than "
> +			"supported maximum %d\n", __func__,
> +			num, NFS4_PNFS_MAX_MULTI_CNT);
> +		goto out_err;
> +	}
> +	dsaddr = kzalloc(sizeof(*dsaddr) +
> +			(sizeof(struct nfs4_pnfs_ds *) * (num - 1)),
> +			GFP_KERNEL);
> +	if (!dsaddr)
> +		goto out_err;
> +
> +	dsaddr->stripe_indices = kzalloc(sizeof(u8) * cnt, GFP_KERNEL);
> +	if (!dsaddr->stripe_indices)
> +		goto out_err_free;
> +
> +	dsaddr->stripe_count = cnt;
> +	dsaddr->ds_num = num;
> +
> +	memcpy(&dsaddr->deviceid.de_id, &pdev->dev_id, sizeof(pdev->dev_id));
> +
> +	/* Go back an read stripe indices */
> +	p = indicesp;
> +	indexp = &dsaddr->stripe_indices[0];
> +	for (i = 0; i < dsaddr->stripe_count; i++) {
> +		*indexp = be32_to_cpup(p++);
> +		if (*indexp >= num)
> +			goto out_err_free;
> +		indexp++;
> +	}
> +	/* Skip already read multipath list count */
> +	p++;
> +
> +	for (i = 0; i < dsaddr->ds_num; i++) {
> +		int j;
> +
> +		dummy = be32_to_cpup(p++); /* multipath count */
> +		if (dummy > 1) {
> +			printk(KERN_WARNING
> +			       "%s: Multipath count %d not supported, "
> +			       "skipping all greater than 1\n", __func__,
> +				dummy);
> +		}
> +		for (j = 0; j < dummy; j++) {
> +			if (j == 0) {
> +				dsaddr->ds_list[i] = decode_and_add_ds(&p, ino);
> +				if (dsaddr->ds_list[i] == NULL)
> +					goto out_err_free;
> +			} else {
> +				u32 len;
> +				/* skip extra multipath */
> +				len = be32_to_cpup(p++);
> +				p += XDR_QUADLEN(len);
> +				len = be32_to_cpup(p++);
> +				p += XDR_QUADLEN(len);
> +				continue;
> +			}
> +		}
> +	}
> +	nfs4_init_deviceid_node(&dsaddr->deviceid);
> +
> +	return dsaddr;
> +
> +out_err_free:
> +	nfs4_fl_free_deviceid(dsaddr);
> +out_err:
> +	dprintk("%s ERROR: returning NULL\n", __func__);
> +	return NULL;
> +}
> +
> +/*
> + * Decode the opaque device specified in 'dev'
> + * and add it to the list of available devices.
> + * If the deviceid is already cached, nfs4_add_deviceid will return
> + * a pointer to the cached struct and throw away the new.
> + */
> +static struct nfs4_file_layout_dsaddr*
> +decode_and_add_device(struct inode *inode, struct pnfs_device *dev)
> +{
> +	struct nfs4_file_layout_dsaddr *dsaddr;
> +	struct nfs4_deviceid *d;
> +
> +	dsaddr = decode_device(inode, dev);
> +	if (!dsaddr) {
> +		printk(KERN_WARNING "%s: Could not decode or add device\n",
> +			__func__);
> +		return NULL;
> +	}
> +
> +	d = nfs4_add_deviceid(NFS_SERVER(inode)->nfs_client->cl_devid_cache,
> +			      &dsaddr->deviceid);
> +
> +	return container_of(d, struct nfs4_file_layout_dsaddr, deviceid);
> +}
> +
> +/*
> + * Retrieve the information for dev_id, add it to the list
> + * of available devices, and return it.
> + */
> +struct nfs4_file_layout_dsaddr *
> +get_device_info(struct inode *inode, struct pnfs_deviceid *dev_id)
> +{
> +	struct pnfs_device *pdev = NULL;
> +	u32 max_resp_sz;
> +	int max_pages;
> +	struct page **pages = NULL;
> +	struct nfs4_file_layout_dsaddr *dsaddr = NULL;
> +	int rc, i;
> +	struct nfs_server *server = NFS_SERVER(inode);
> +
> +	/*
> +	 * Use the session max response size as the basis for setting
> +	 * GETDEVICEINFO's maxcount
> +	 */
> +	max_resp_sz = server->nfs_client->cl_session->fc_attrs.max_resp_sz;
> +	max_pages = max_resp_sz >> PAGE_SHIFT;
> +	dprintk("%s inode %p max_resp_sz %u max_pages %d\n",
> +		__func__, inode, max_resp_sz, max_pages);
> +
> +	pdev = kzalloc(sizeof(struct pnfs_device), GFP_KERNEL);
> +	if (pdev == NULL)
> +		return NULL;
> +
> +	pages = kzalloc(max_pages * sizeof(struct page *), GFP_KERNEL);
> +	if (pages == NULL) {
> +		kfree(pdev);
> +		return NULL;
> +	}
> +	for (i = 0; i < max_pages; i++) {
> +		pages[i] = alloc_page(GFP_KERNEL);
> +		if (!pages[i])
> +			goto out_free;
> +	}
> +
> +	/* set pdev->area */
> +	pdev->area = vmap(pages, max_pages, VM_MAP, PAGE_KERNEL);
> +	if (!pdev->area)
> +		goto out_free;
> +
> +	memcpy(&pdev->dev_id, dev_id, sizeof(*dev_id));
> +	pdev->layout_type = LAYOUT_NFSV4_1_FILES;
> +	pdev->pages = pages;
> +	pdev->pgbase = 0;
> +	pdev->pglen = PAGE_SIZE * max_pages;
> +	pdev->mincount = 0;
> +	/* TODO: Update types when CB_NOTIFY_DEVICEID is available */
> +	pdev->dev_notify_types = 0;
> +
> +	rc = nfs4_proc_getdeviceinfo(server, pdev);
> +	dprintk("%s getdevice info returns %d\n", __func__, rc);
> +	if (rc)
> +		goto out_free;
> +
> +	/*
> +	 * Found new device, need to decode it and then add it to the
> +	 * list of known devices for this mountpoint.
> +	 */
> +	dsaddr = decode_and_add_device(inode, pdev);
> +out_free:
> +	if (pdev->area != NULL)
> +		vunmap(pdev->area);
> +	for (i = 0; i < max_pages; i++)
> +		__free_page(pages[i]);
> +	kfree(pages);
> +	kfree(pdev);
> +	dprintk("<-- %s dsaddr %p\n", __func__, dsaddr);
> +	return dsaddr;
> +}
> +
> +struct nfs4_file_layout_dsaddr *
> +nfs4_fl_find_get_deviceid(struct nfs_client *clp, struct pnfs_deviceid *id)
> +{
> +	struct nfs4_deviceid *d;
> +
> +	d = nfs4_find_get_deviceid(clp->cl_devid_cache, id);
> +	return (d == NULL) ? NULL :
> +		container_of(d, struct nfs4_file_layout_dsaddr, deviceid);
> +}



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
       [not found]     ` <1284146604.10062.68.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2010-09-10 20:53       ` Fred Isaman
  0 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-10 20:53 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Fri, Sep 10, 2010 at 12:23 PM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>>
>> Allow a module implementing a layout type to register, and
>> have its mount/umount routines called for filesystems that
>> the server declares support it.
>>
>> Signed-off-by: TBD - melding/reorganization of several patches
>> ---
>> =A0Documentation/filesystems/nfs/00-INDEX | =A0 =A02 +
>> =A0Documentation/filesystems/nfs/pnfs.txt | =A0 48 +++++++++++++++++=
++
>> =A0fs/nfs/Kconfig =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =
=A0 =A02 +-
>> =A0fs/nfs/pnfs.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
| =A0 79 +++++++++++++++++++++++++++++++-
>> =A0fs/nfs/pnfs.h =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
| =A0 14 ++++++
>> =A05 files changed, 142 insertions(+), 3 deletions(-)
>> =A0create mode 100644 Documentation/filesystems/nfs/pnfs.txt
>>
>> diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/=
filesystems/nfs/00-INDEX
>> index 2f68cd6..8d930b9 100644
>> --- a/Documentation/filesystems/nfs/00-INDEX
>> +++ b/Documentation/filesystems/nfs/00-INDEX
>> @@ -12,5 +12,7 @@ nfs-rdma.txt
>> =A0 =A0 =A0 - how to install and setup the Linux NFS/RDMA client and=
 server software
>> =A0nfsroot.txt
>> =A0 =A0 =A0 - short guide on setting up a diskless box with NFS root=
 filesystem.
>> +pnfs.txt
>> + =A0 =A0 - short explanation of some of the internals of the pnfs c=
ode
>> =A0rpc-cache.txt
>> =A0 =A0 =A0 - introduction to the caching mechanisms in the sunrpc l=
ayer.
>> diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/=
filesystems/nfs/pnfs.txt
>> new file mode 100644
>> index 0000000..bc0b9cf
>> --- /dev/null
>> +++ b/Documentation/filesystems/nfs/pnfs.txt
>> @@ -0,0 +1,48 @@
>> +Reference counting in pnfs:
>> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
>> +
>> +The are several inter-related caches. =A0We have layouts which can
>> +reference multiple devices, each of which can reference multiple da=
ta servers.
>> +Each data server can be referenced by multiple devices. =A0Each dev=
ice
>> +can be referenced by multiple layouts. =A0To keep all of this strai=
ght,
>> +we need to reference count.
>> +
>> +
>> +struct pnfs_layout_hdr
>> +----------------------
>> +The on-the-wire command LAYOUTGET corresponds to struct
>> +pnfs_layout_segment, usually referred to by the variable name lseg.
>> +Each nfs_inode may hold a pointer to a cache of of these layout
>> +segments in nfsi->layout, of type struct pnfs_layout_hdr.
>> +
>> +We reference the header for the inode pointing to it, across each
>> +outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
>> +LAYOUTCOMMIT), and for each lseg held within.
>> +
>> +Each header is also (when non-empty) put on a list associated with
>> +struct nfs_client (cl_layouts). =A0Being put on this list does not =
bump
>> +the reference count, as the layout is kept around by the lseg that
>> +keeps it in the list.
>> +
>> +deviceid_cache
>> +--------------
>> +lsegs reference device ids, which are resolved per nfs_client and
>> +layout driver type. =A0The device ids are held in a RCU cache (stru=
ct
>> +nfs4_deviceid_cache). =A0The cache itself is referenced across each
>> +mount. =A0The entries (struct nfs4_deviceid) themselves are held ac=
ross
>> +the lifetime of each lseg referencing them.
>> +
>> +RCU is used because the deviceid is basically a write once, read ma=
ny
>> +data structure. =A0The hlist size of 32 buckets needs better
>> +justification, but seems reasonable given that we can have multiple
>> +deviceid's per filesystem, and multiple filesystems per nfs_client.
>> +
>> +The hash code is copied from the nfsd code base. =A0A discussion of
>> +hashing and variations of this algorithm can be found at:
>> +http://groups.google.com/group/comp.lang.c/browse_thread/thread/952=
2965e2b8d3809
>> +
>> +data server cache
>> +-----------------
>> +file driver devices refer to data servers, which are kept in a modu=
le
>> +level cache. =A0Its reference is held over the lifetime of the devi=
ceid
>> +pointing to it.
>> diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
>> index 6c2aad4..5f1b936 100644
>> --- a/fs/nfs/Kconfig
>> +++ b/fs/nfs/Kconfig
>> @@ -78,7 +78,7 @@ config NFS_V4_1
>> =A0 =A0 =A0 depends on NFS_V4 && EXPERIMENTAL
>> =A0 =A0 =A0 help
>> =A0 =A0 =A0 =A0 This option enables support for minor version 1 of t=
he NFSv4 protocol
>> - =A0 =A0 =A0 (draft-ietf-nfsv4-minorversion1) in the kernel's NFS c=
lient.
>> + =A0 =A0 =A0 (RFC 5661) in the kernel's NFS client.
>>
>> =A0 =A0 =A0 =A0 If unsure, say N.
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 2e5dba1..8d503fc 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -32,16 +32,48 @@
>>
>> =A0#define NFSDBG_FACILITY =A0 =A0 =A0 =A0 =A0 =A0 =A0NFSDBG_PNFS
>>
>> -/* STUB that returns the equivalent of "no module found" */
>> +/* Locking:
>> + *
>> + * pnfs_spinlock:
>> + * =A0 =A0 =A0protects pnfs_modules_tbl.
>> + */
>> +static DEFINE_SPINLOCK(pnfs_spinlock);
>> +
>> +/*
>> + * pnfs_modules_tbl holds all pnfs modules
>> + */
>> +static LIST_HEAD(pnfs_modules_tbl);
>> +
>> +/* Return the registered pnfs layout driver module matching given i=
d */
>> +static struct pnfs_layoutdriver_type *
>> +find_pnfs_driver_locked(u32 id) {
>> + =A0 =A0 struct =A0pnfs_layoutdriver_type *local;
>> +
>> + =A0 =A0 dprintk("PNFS: %s: Searching for %u\n", __func__, id);
>> + =A0 =A0 list_for_each_entry(local, &pnfs_modules_tbl, pnfs_tblid)
>> + =A0 =A0 =A0 =A0 =A0 =A0 if (local->id =3D=3D id)
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
>> + =A0 =A0 local =3D NULL;
>> +out:
>> + =A0 =A0 return local;
>> +}
>> +
>> =A0static struct pnfs_layoutdriver_type *
>> =A0find_pnfs_driver(u32 id) {
>> - =A0 =A0 return NULL;
>> + =A0 =A0 struct =A0pnfs_layoutdriver_type *local;
>> +
>> + =A0 =A0 spin_lock(&pnfs_spinlock);
>> + =A0 =A0 local =3D find_pnfs_driver_locked(id);
>
> Don't you want some kind of reference count on this? I'd assume that =
you
> probably need a module_get() with a corresponding module_put() when y=
ou
> are done using the layoutdriver.
>

OK


>> + =A0 =A0 spin_unlock(&pnfs_spinlock);
>> + =A0 =A0 return local;
>> =A0}
>>
>> =A0/* Unitialize a mountpoint in a layout driver */
>> =A0void
>> =A0unset_pnfs_layoutdriver(struct nfs_server *nfss)
>> =A0{
>> + =A0 =A0 if (nfss->pnfs_curr_ld)
>> + =A0 =A0 =A0 =A0 =A0 =A0 nfss->pnfs_curr_ld->ld_io_ops->uninitializ=
e_mountpoint(nfss->nfs_client);
>
> That 'uninitialize_mountpoint' name doesn't make any sense. The
> nfs_client parameter isn't associated to a particular mountpoint.
>
>> =A0 =A0 =A0 nfss->pnfs_curr_ld =3D NULL;
>> =A0}
>>
>> @@ -68,6 +100,12 @@ set_pnfs_layoutdriver(struct nfs_server *server,=
 u32 id)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out_no_driver;
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> =A0 =A0 =A0 }
>> + =A0 =A0 if (ld_type->ld_io_ops->initialize_mountpoint(server->nfs_=
client)) {
>
> Ditto.
>

OK.

=46red

>> + =A0 =A0 =A0 =A0 =A0 =A0 printk(KERN_ERR
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"%s: Error initializing mou=
nt point for layout driver %u.\n",
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0__func__, id);
>> + =A0 =A0 =A0 =A0 =A0 =A0 goto out_no_driver;
>> + =A0 =A0 }
>> =A0 =A0 =A0 server->pnfs_curr_ld =3D ld_type;
>> =A0 =A0 =A0 dprintk("%s: pNFS module for %u set\n", __func__, id);
>> =A0 =A0 =A0 return;
>> @@ -76,3 +114,40 @@ out_no_driver:
>> =A0 =A0 =A0 dprintk("%s: Using NFSv4 I/O\n", __func__);
>> =A0 =A0 =A0 server->pnfs_curr_ld =3D NULL;
>> =A0}
>> +
>> +int
>> +pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>> +{
>> + =A0 =A0 struct layoutdriver_io_operations *io_ops =3D ld_type->ld_=
io_ops;
>> + =A0 =A0 int status =3D -EINVAL;
>> +
>> + =A0 =A0 if (!io_ops) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 printk(KERN_ERR "%s Layout driver must pro=
vide io_ops\n",
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __func__);
>> + =A0 =A0 =A0 =A0 =A0 =A0 return status;
>> + =A0 =A0 }
>> +
>> + =A0 =A0 spin_lock(&pnfs_spinlock);
>> + =A0 =A0 if (!find_pnfs_driver_locked(ld_type->id)) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 list_add(&ld_type->pnfs_tblid, &pnfs_modul=
es_tbl);
>> + =A0 =A0 =A0 =A0 =A0 =A0 status =3D 0;
>> + =A0 =A0 =A0 =A0 =A0 =A0 dprintk("%s Registering id:%u name:%s\n", =
__func__, ld_type->id,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ld_type->name);
>> + =A0 =A0 } else
>> + =A0 =A0 =A0 =A0 =A0 =A0 printk(KERN_ERR "%s Module with id %d alre=
ady loaded!\n",
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __func__, ld_type->id);
>> + =A0 =A0 spin_unlock(&pnfs_spinlock);
>> +
>> + =A0 =A0 return status;
>> +}
>> +EXPORT_SYMBOL(pnfs_register_layoutdriver);
>> +
>> +void
>> +pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type=
)
>> +{
>> + =A0 =A0 dprintk("%s Deregistering id:%u\n", __func__, ld_type->id)=
;
>> + =A0 =A0 spin_lock(&pnfs_spinlock);
>> + =A0 =A0 list_del(&ld_type->pnfs_tblid);
>> + =A0 =A0 spin_unlock(&pnfs_spinlock);
>> +}
>> +EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index 3281fbf..9049b9a 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -16,8 +16,22 @@
>>
>> =A0/* Per-layout driver specific registration structure */
>> =A0struct pnfs_layoutdriver_type {
>> + =A0 =A0 struct list_head pnfs_tblid;
>> + =A0 =A0 const u32 id;
>> + =A0 =A0 const char *name;
>> + =A0 =A0 struct layoutdriver_io_operations *ld_io_ops;
>> =A0};
>>
>> +/* Layout driver I/O operations. */
>> +struct layoutdriver_io_operations {
>> + =A0 =A0 /* Registration information for a new mounted file system =
*/
>> + =A0 =A0 int (*initialize_mountpoint) (struct nfs_client *);
>> + =A0 =A0 int (*uninitialize_mountpoint) (struct nfs_client *);
>> +};
>> +
>> +extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type=
 *);
>> +extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_t=
ype *);
>> +
>> =A0void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
>> =A0void unset_pnfs_layoutdriver(struct nfs_server *);
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-10 19:31   ` Trond Myklebust
@ 2010-09-10 21:11     ` Fred Isaman
  2010-09-10 22:37       ` Trond Myklebust
  2010-09-13 10:16       ` Benny Halevy
  2010-09-10 23:56     ` Christoph Hellwig
  1 sibling, 2 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-10 21:11 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Fri, Sep 10, 2010 at 12:31 PM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>>
>> This driver just registers itself and supplies trivial mount/umount functions.
>>
>> Signed-off-by: TBD - melding/reorganization of several patches
>> ---
>>  fs/nfs/Kconfig          |    5 +++
>>  fs/nfs/Makefile         |    3 ++
>>  fs/nfs/nfs4filelayout.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/nfs_fs.h  |    1 +
>>  4 files changed, 98 insertions(+), 0 deletions(-)
>>  create mode 100644 fs/nfs/nfs4filelayout.c
>>
>> diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
>> index 5f1b936..980f2dc 100644
>> --- a/fs/nfs/Kconfig
>> +++ b/fs/nfs/Kconfig
>> @@ -82,6 +82,11 @@ config NFS_V4_1
>>
>>         If unsure, say N.
>>
>> +config PNFS_FILE_LAYOUT
>> +     tristate
>> +     depends on NFS_FS && NFS_V4_1
>> +     default m
>
> Should be 'default y', otherwise it has an implicit dependency on
> CONFIG_MODULES.
>

The idea was that normally the driver would compile as a module, and
use loading/unloading of it to control whether pnfs is supported.

Is there a way to do this that does not introduce the implicit dependency?


>> +
>>  config ROOT_NFS
>>       bool "Root file system on NFS"
>>       depends on NFS_FS=y && IP_PNP
>> diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
>> index bb9e773..08a8889 100644
>> --- a/fs/nfs/Makefile
>> +++ b/fs/nfs/Makefile
>> @@ -18,3 +18,6 @@ nfs-$(CONFIG_NFS_V4)        += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
>>  nfs-$(CONFIG_NFS_V4_1)       += pnfs.o
>>  nfs-$(CONFIG_SYSCTL) += sysctl.o
>>  nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
>> +
>> +obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
>> +nfs_layout_nfsv41_files-y := nfs4filelayout.o
>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>> new file mode 100644
>> index 0000000..c685196
>> --- /dev/null
>> +++ b/fs/nfs/nfs4filelayout.c
>> @@ -0,0 +1,89 @@
>> +/*
>> + *  Module for the pnfs nfs4 file layout driver.
>> + *  Defines all I/O and Policy interface operations, plus code
>> + *  to register itself with the pNFS client.
>> + *
>> + *  Copyright (c) 2002
>> + *  The Regents of the University of Michigan
>> + *  All Rights Reserved
>> + *
>> + *  Dean Hildebrand <dhildebz@umich.edu>
>> + *
>> + *  Permission is granted to use, copy, create derivative works, and
>> + *  redistribute this software and such derivative works for any purpose,
>> + *  so long as the name of the University of Michigan is not used in
>> + *  any advertising or publicity pertaining to the use or distribution
>> + *  of this software without specific, written prior authorization. If
>> + *  the above copyright notice or any other identification of the
>> + *  University of Michigan is included in any copy of any portion of
>> + *  this software, then the disclaimer below must also be included.
>> + *
>> + *  This software is provided as is, without representation or warranty
>> + *  of any kind either express or implied, including without limitation
>> + *  the implied warranties of merchantability, fitness for a particular
>> + *  purpose, or noninfringement.  The Regents of the University of
>> + *  Michigan shall not be liable for any damages, including special,
>> + *  indirect, incidental, or consequential damages, with respect to any
>> + *  claim arising out of or in connection with the use of the software,
>> + *  even if it has been or is hereafter advised of the possibility of
>> + *  such damages.
>> + */
>> +
>> +#include <linux/nfs_fs.h>
>> +#include "pnfs.h"
>> +
>> +#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Dean Hildebrand <dhildebz@umich.edu>");
>> +MODULE_DESCRIPTION("The NFSv4 file layout driver");
>> +
>> +int
>> +filelayout_initialize_mountpoint(struct nfs_client *clp)
>> +{
>> +     return 0;
>> +}
>> +
>> +int
>> +filelayout_uninitialize_mountpoint(struct nfs_client *clp)
>> +{
>> +     dprintk("--> %s\n", __func__);
>> +
>> +     return 0;
>> +}
>> +
>> +struct layoutdriver_io_operations filelayout_io_operations = {
>
> Should definitely be declared as 'const' (and possibly 'static').
>

OK

>> +     .initialize_mountpoint   = filelayout_initialize_mountpoint,
>> +     .uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
>> +};
>> +
>> +
>> +struct pnfs_layoutdriver_type filelayout_type = {
>
> Ditto.

This includes a list_head field which is set by the generic layer.

>
>> +     .id = LAYOUT_NFSV4_1_FILES,
>> +     .name = "LAYOUT_NFSV4_1_FILES",
>> +     .ld_io_ops = &filelayout_io_operations,
>
> Why do we need a separate 'struct layoutdriver_io_operations'? Any
> reason those can't just be embedded in struct pnfs_layoutdriver_type?

I believe this decision was primarily aesthetics.  However, keeping
the static io_ops seperate from the variable list_head seems like a
good idea.

Perhaps having a driver structure that includes the io_ops and static
portions of pnfs_layoutdriver_type, with the generic layer allocating
a wrapper structure that is basically:
struct {
    struct list_head list;
    struct pnfs_layoutdriver_type *driver_info;
}


Fred


>
>> +};
>> +
>> +static int __init nfs4filelayout_init(void)
>> +{
>> +     printk(KERN_INFO "%s: NFSv4 File Layout Driver Registering...\n",
>> +            __func__);
>> +
>> +     /*
>> +      * Need to register file_operations struct with global list to indicate
>> +      * that NFS4 file layout is a possible pNFS I/O module
>> +      */
>> +     return pnfs_register_layoutdriver(&filelayout_type);
>> +}
>> +
>> +static void __exit nfs4filelayout_exit(void)
>> +{
>> +     printk(KERN_INFO "%s: NFSv4 File Layout Driver Unregistering...\n",
>> +            __func__);
>> +
>> +     /* Unregister NFS4 file layout driver with pNFS client*/
>> +     pnfs_unregister_layoutdriver(&filelayout_type);
>> +}
>> +
>> +module_init(nfs4filelayout_init);
>> +module_exit(nfs4filelayout_exit);
>> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
>> index 042c2bd..a0f49a3 100644
>> --- a/include/linux/nfs_fs.h
>> +++ b/include/linux/nfs_fs.h
>> @@ -614,6 +614,7 @@ extern void * nfs_root_data(void);
>>  #define NFSDBG_MOUNT         0x0400
>>  #define NFSDBG_FSCACHE               0x0800
>>  #define NFSDBG_PNFS          0x1000
>> +#define NFSDBG_PNFS_LD               0x2000
>>  #define NFSDBG_ALL           0xFFFF
>>
>>  #ifdef __KERNEL__
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/13] RFC: nfs: create and destroy inode's layout cache
       [not found]     ` <1284147785.10062.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2010-09-10 21:13       ` Fred Isaman
  0 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-10 21:13 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Fri, Sep 10, 2010 at 12:43 PM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>>
>> At the start of the io paths, try to grab the relevant layout
>> information. =A0This will initiate the inode's layout cache, but
>> stubs ensure the cache stays empty.
>>
>> Signed-off-by: TBD - melding/reorganization of several patches
>> ---
>> =A0fs/nfs/file.c =A0 =A0 =A0 =A0 =A0| =A0 =A05 ++
>> =A0fs/nfs/inode.c =A0 =A0 =A0 =A0 | =A0 =A03 +
>> =A0fs/nfs/pnfs.c =A0 =A0 =A0 =A0 =A0| =A0140 +++++++++++++++++++++++=
+++++++++++++++++++++++++
>> =A0fs/nfs/pnfs.h =A0 =A0 =A0 =A0 =A0| =A0 39 +++++++++++++
>> =A0fs/nfs/read.c =A0 =A0 =A0 =A0 =A0| =A0 =A03 +
>> =A0include/linux/nfs_fs.h | =A0 =A03 +
>> =A06 files changed, 193 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index eb51bd6..10ebdfb 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -36,6 +36,7 @@
>> =A0#include "internal.h"
>> =A0#include "iostat.h"
>> =A0#include "fscache.h"
>> +#include "pnfs.h"
>>
>> =A0#define NFSDBG_FACILITY =A0 =A0 =A0 =A0 =A0 =A0 =A0NFSDBG_FILE
>>
>> @@ -386,6 +387,10 @@ static int nfs_write_begin(struct file *file, s=
truct address_space *mapping,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 file->f_path.dentry->d_name.name,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 mapping->host->i_ino, len, (long long) p=
os);
>>
>> + =A0 =A0 pnfs_update_layout(mapping->host,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nfs_file_open_conte=
xt(file),
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0IOMODE_RW);
>> +
>> =A0start:
>> =A0 =A0 =A0 /*
>> =A0 =A0 =A0 =A0* Prevent starvation issues if someone is doing a con=
sistency
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 7d2d6c7..0dc6dad 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -48,6 +48,7 @@
>> =A0#include "internal.h"
>> =A0#include "fscache.h"
>> =A0#include "dns_resolve.h"
>> +#include "pnfs.h"
>>
>> =A0#define NFSDBG_FACILITY =A0 =A0 =A0 =A0 =A0 =A0 =A0NFSDBG_VFS
>>
>> @@ -1409,6 +1410,7 @@ void nfs4_evict_inode(struct inode *inode)
>> =A0{
>> =A0 =A0 =A0 truncate_inode_pages(&inode->i_data, 0);
>> =A0 =A0 =A0 end_writeback(inode);
>> + =A0 =A0 pnfs_destroy_layout(NFS_I(inode));
>> =A0 =A0 =A0 /* If we are holding a delegation, return it! */
>> =A0 =A0 =A0 nfs_inode_return_delegation_noreclaim(inode);
>> =A0 =A0 =A0 /* First call standard NFS clear_inode() code */
>> @@ -1446,6 +1448,7 @@ static inline void nfs4_init_once(struct nfs_i=
node *nfsi)
>> =A0 =A0 =A0 nfsi->delegation =3D NULL;
>> =A0 =A0 =A0 nfsi->delegation_state =3D 0;
>> =A0 =A0 =A0 init_rwsem(&nfsi->rwsem);
>> + =A0 =A0 nfsi->layout =3D NULL;
>> =A0#endif
>> =A0}
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 8d503fc..65f923b 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -151,3 +151,143 @@ pnfs_unregister_layoutdriver(struct pnfs_layou=
tdriver_type *ld_type)
>> =A0 =A0 =A0 spin_unlock(&pnfs_spinlock);
>> =A0}
>> =A0EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
>> +
>> +static void
>> +get_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>> +{
>> + =A0 =A0 assert_spin_locked(&lo->inode->i_lock);
>> + =A0 =A0 lo->refcount++;
>> +}
>> +
>> +static void
>> +put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>> +{
>> + =A0 =A0 assert_spin_locked(&lo->inode->i_lock);
>> + =A0 =A0 BUG_ON(lo->refcount <=3D 0);
>> +
>> + =A0 =A0 lo->refcount--;
>> + =A0 =A0 if (!lo->refcount) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 dprintk("%s: freeing layout cache %p\n", _=
_func__, lo);
>> + =A0 =A0 =A0 =A0 =A0 =A0 NFS_I(lo->inode)->layout =3D NULL;
>> + =A0 =A0 =A0 =A0 =A0 =A0 kfree(lo);
>> + =A0 =A0 }
>> +}
>> +
>> +void
>> +pnfs_destroy_layout(struct nfs_inode *nfsi)
>> +{
>> + =A0 =A0 struct pnfs_layout_hdr *lo;
>> +
>> + =A0 =A0 spin_lock(&nfsi->vfs_inode.i_lock);
>> + =A0 =A0 lo =3D nfsi->layout;
>> + =A0 =A0 if (lo) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 /* Matched by refcount set to 1 in alloc_i=
nit_layout_hdr */
>> + =A0 =A0 =A0 =A0 =A0 =A0 put_layout_hdr_locked(lo);
>> + =A0 =A0 }
>> + =A0 =A0 spin_unlock(&nfsi->vfs_inode.i_lock);
>> +}
>> +
>> +/* STUB - pretend LAYOUTGET to server failed */
>> +static struct pnfs_layout_segment *
>> +send_layoutget(struct pnfs_layout_hdr *lo,
>> + =A0 =A0 =A0 =A0struct nfs_open_context *ctx,
>> + =A0 =A0 =A0 =A0u32 iomode)
>> +{
>> + =A0 =A0 struct inode *ino =3D lo->inode;
>> +
>> + =A0 =A0 set_bit(lo_fail_bit(iomode), &lo->state);
>> + =A0 =A0 spin_lock(&ino->i_lock);
>> + =A0 =A0 put_layout_hdr_locked(lo);
>> + =A0 =A0 spin_unlock(&ino->i_lock);
>> + =A0 =A0 return NULL;
>> +}
>> +
>> +static struct pnfs_layout_hdr *
>> +alloc_init_layout_hdr(struct inode *ino)
>> +{
>> + =A0 =A0 struct pnfs_layout_hdr *lo;
>> +
>> + =A0 =A0 lo =3D kzalloc(sizeof(struct pnfs_layout_hdr), GFP_KERNEL)=
;
>> + =A0 =A0 if (!lo)
>> + =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
>> + =A0 =A0 lo->refcount =3D 1;
>> + =A0 =A0 lo->inode =3D ino;
>> + =A0 =A0 return lo;
>> +}
>> +
>> +static struct pnfs_layout_hdr *
>> +pnfs_find_alloc_layout(struct inode *ino)
>> +{
>> + =A0 =A0 struct nfs_inode *nfsi =3D NFS_I(ino);
>> + =A0 =A0 struct pnfs_layout_hdr *new =3D NULL;
>> +
>> + =A0 =A0 dprintk("%s Begin ino=3D%p layout=3D%p\n", __func__, ino, =
nfsi->layout);
>> +
>> + =A0 =A0 assert_spin_locked(&ino->i_lock);
>> + =A0 =A0 if (nfsi->layout)
>> + =A0 =A0 =A0 =A0 =A0 =A0 return nfsi->layout;
>> +
>> + =A0 =A0 spin_unlock(&ino->i_lock);
>> + =A0 =A0 new =3D alloc_init_layout_hdr(ino);
>> + =A0 =A0 spin_lock(&ino->i_lock);
>> +
>> + =A0 =A0 if (likely(nfsi->layout =3D=3D NULL)) =A0 =A0 =A0 /* Won t=
he race? */
>> + =A0 =A0 =A0 =A0 =A0 =A0 nfsi->layout =3D new;
>> + =A0 =A0 else
>> + =A0 =A0 =A0 =A0 =A0 =A0 kfree(new);
>> + =A0 =A0 return nfsi->layout;
>> +}
>> +
>> +/* STUB - LAYOUTGET never succeeds, so cache is empty */
>> +static struct pnfs_layout_segment *
>> +pnfs_has_layout(struct pnfs_layout_hdr *lo, u32 iomode)
>> +{
>> + =A0 =A0 return NULL;
>> +}
>> +
>> +/*
>> + * Layout segment is retreived from the server if not cached.
>> + * The appropriate layout segment is referenced and returned to the=
 caller.
>> + */
>> +struct pnfs_layout_segment *
>> +pnfs_update_layout(struct inode *ino,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct nfs_open_context *ctx,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0enum pnfs_iomode iomode)
>> +{
>> + =A0 =A0 struct nfs_inode *nfsi =3D NFS_I(ino);
>> + =A0 =A0 struct pnfs_layout_hdr *lo;
>> + =A0 =A0 struct pnfs_layout_segment *lseg =3D NULL;
>> +
>> + =A0 =A0 if (!pnfs_enabled_sb(NFS_SERVER(ino)))
>> + =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
>> + =A0 =A0 spin_lock(&ino->i_lock);
>> + =A0 =A0 lo =3D pnfs_find_alloc_layout(ino);
>> + =A0 =A0 if (lo =3D=3D NULL) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 dprintk("%s ERROR: can't get pnfs_layout_h=
dr\n", __func__);
>> + =A0 =A0 =A0 =A0 =A0 =A0 goto out_unlock;
>> + =A0 =A0 }
>> +
>> + =A0 =A0 /* Check to see if the layout for the given range already =
exists */
>> + =A0 =A0 lseg =3D pnfs_has_layout(lo, iomode);
>> + =A0 =A0 if (lseg) {
>> + =A0 =A0 =A0 =A0 =A0 =A0 dprintk("%s: Using cached lseg %p for iomo=
de %d)\n",
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __func__, lseg, iomode);
>> + =A0 =A0 =A0 =A0 =A0 =A0 goto out_unlock;
>> + =A0 =A0 }
>> +
>> + =A0 =A0 /* if LAYOUTGET already failed once we don't try again */
>> + =A0 =A0 if (test_bit(lo_fail_bit(iomode), &nfsi->layout->state))
>> + =A0 =A0 =A0 =A0 =A0 =A0 goto out_unlock;
>> +
>> + =A0 =A0 get_layout_hdr_locked(lo);
>> + =A0 =A0 spin_unlock(&ino->i_lock);
>> +
>> + =A0 =A0 lseg =3D send_layoutget(lo, ctx, iomode);
>> +out:
>> + =A0 =A0 dprintk("%s end, state 0x%lx lseg %p\n", __func__,
>> + =A0 =A0 =A0 =A0 =A0 =A0 nfsi->layout->state, lseg);
>> + =A0 =A0 return lseg;
>> +out_unlock:
>> + =A0 =A0 spin_unlock(&ino->i_lock);
>> + =A0 =A0 goto out;
>> +}
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index 9049b9a..b63b445 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -14,6 +14,11 @@
>>
>> =A0#define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
>>
>> +enum {
>> + =A0 =A0 NFS_LAYOUT_RO_FAILED =3D 0, =A0 =A0 =A0 /* get ro layout f=
ailed stop trying */
>> + =A0 =A0 NFS_LAYOUT_RW_FAILED, =A0 =A0 =A0 =A0 =A0 /* get rw layout=
 failed stop trying */
>> +};
>> +
>> =A0/* Per-layout driver specific registration structure */
>> =A0struct pnfs_layoutdriver_type {
>> =A0 =A0 =A0 struct list_head pnfs_tblid;
>> @@ -22,6 +27,12 @@ struct pnfs_layoutdriver_type {
>> =A0 =A0 =A0 struct layoutdriver_io_operations *ld_io_ops;
>> =A0};
>>
>> +struct pnfs_layout_hdr {
>> + =A0 =A0 int =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 refcount;
> =A0 =A0 =A0 =A0^^^^^ Why not make this 'unsigned int', and/or 'unsign=
ed long'?

OK.

=46red

>> + =A0 =A0 unsigned long =A0 =A0 =A0 =A0 =A0 state;
>> + =A0 =A0 struct inode =A0 =A0 =A0 =A0 =A0 =A0*inode;
>> +};
>> +
>> =A0/* Layout driver I/O operations. */
>> =A0struct layoutdriver_io_operations {
>> =A0 =A0 =A0 /* Registration information for a new mounted file syste=
m */
>> @@ -32,11 +43,39 @@ struct layoutdriver_io_operations {
>> =A0extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_ty=
pe *);
>> =A0extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver=
_type *);
>>
>> +struct pnfs_layout_segment *
>> +pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0enum pnfs_iomode access_type);
>> =A0void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
>> =A0void unset_pnfs_layoutdriver(struct nfs_server *);
>> +void pnfs_destroy_layout(struct nfs_inode *);
>> +
>> +
>> +static inline int lo_fail_bit(u32 iomode)
>> +{
>> + =A0 =A0 return iomode =3D=3D IOMODE_RW ?
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0NFS_LAYOUT_RW_FAILED : =
NFS_LAYOUT_RO_FAILED;
>> +}
>> +
>> +/* Return true if a layout driver is being used for this mountpoint=
 */
>> +static inline int pnfs_enabled_sb(struct nfs_server *nfss)
>> +{
>> + =A0 =A0 return nfss->pnfs_curr_ld !=3D NULL;
>> +}
>>
>> =A0#else =A0/* CONFIG_NFS_V4_1 */
>>
>> +static inline void pnfs_destroy_layout(struct nfs_inode *nfsi)
>> +{
>> +}
>> +
>> +static inline struct pnfs_layout_segment *
>> +pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0enum pnfs_iomode access_type)
>> +{
>> + =A0 =A0 return NULL;
>> +}
>> +
>> =A0static inline void set_pnfs_layoutdriver(struct nfs_server *s, u3=
2 id)
>> =A0{
>> =A0}
>> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
>> index 87adc27..f7eb66f 100644
>> --- a/fs/nfs/read.c
>> +++ b/fs/nfs/read.c
>> @@ -25,6 +25,7 @@
>> =A0#include "internal.h"
>> =A0#include "iostat.h"
>> =A0#include "fscache.h"
>> +#include "pnfs.h"
>>
>> =A0#define NFSDBG_FACILITY =A0 =A0 =A0 =A0 =A0 =A0 =A0NFSDBG_PAGECAC=
HE
>>
>> @@ -121,6 +122,7 @@ int nfs_readpage_async(struct nfs_open_context *=
ctx, struct inode *inode,
>> =A0 =A0 =A0 len =3D nfs_page_length(page);
>> =A0 =A0 =A0 if (len =3D=3D 0)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 return nfs_return_empty_page(page);
>> + =A0 =A0 pnfs_update_layout(inode, ctx, IOMODE_READ);
>> =A0 =A0 =A0 new =3D nfs_create_request(ctx, inode, page, 0, len);
>> =A0 =A0 =A0 if (IS_ERR(new)) {
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 unlock_page(page);
>> @@ -625,6 +627,7 @@ int nfs_readpages(struct file *filp, struct addr=
ess_space *mapping,
>> =A0 =A0 =A0 if (ret =3D=3D 0)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto read_complete; /* all pages were re=
ad */
>>
>> + =A0 =A0 pnfs_update_layout(inode, desc.ctx, IOMODE_READ);
>> =A0 =A0 =A0 if (rsize < PAGE_CACHE_SIZE)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 nfs_pageio_init(&pgio, inode, nfs_pagein=
_multi, rsize, 0);
>> =A0 =A0 =A0 else
>> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
>> index a0f49a3..ebd87a9 100644
>> --- a/include/linux/nfs_fs.h
>> +++ b/include/linux/nfs_fs.h
>> @@ -188,6 +188,9 @@ struct nfs_inode {
>> =A0 =A0 =A0 struct nfs_delegation =A0 *delegation;
>> =A0 =A0 =A0 fmode_t =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0delegation_st=
ate;
>> =A0 =A0 =A0 struct rw_semaphore =A0 =A0 rwsem;
>> +
>> + =A0 =A0 /* pNFS layout information */
>> + =A0 =A0 struct pnfs_layout_hdr *layout;
>> =A0#endif /* CONFIG_NFS_V4*/
>> =A0#ifdef CONFIG_NFS_FSCACHE
>> =A0 =A0 =A0 struct fscache_cookie =A0 *fscache;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/13] RFC: nfs: client needs to maintain list of inodes with active layouts
       [not found]     ` <1284148768.10062.94.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2010-09-10 21:18       ` Fred Isaman
  0 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-10 21:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Fri, Sep 10, 2010 at 12:59 PM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>
>> +static inline struct nfs_server *
>> +PNFS_NFS_SERVER(struct pnfs_layout_hdr *lo)
>> +{
>> + =A0 =A0 return NFS_SERVER(lo->inode);
>> +}
>> +
>
> Why do we need this?

OK, it is gone.

=46red

> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
  2010-09-10 20:11   ` Trond Myklebust
@ 2010-09-10 21:47     ` Fred Isaman
  2010-09-10 22:43       ` Trond Myklebust
  2010-09-13 14:16       ` Benny Halevy
  0 siblings, 2 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-10 21:47 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Fri, Sep 10, 2010 at 1:11 PM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>>
>> Add the ability to actually send LAYOUTGET and GETDEVICEINFO.  This also adds
>> in the machinery to handle layout state and the deviceid cache.  Note that
>> GETDEVICEINFO is not called directly by the generic layer.  Instead it
>> is called by the drivers while parsing the LAYOUTGET opaque data in response
>> to an unknown device id embedded therein.  Annoyingly, RFC 5661 only encodes
>> device ids within the driver-specific opaque data.
>>
>> Signed-off-by: TBD - melding/reorganization of several patches
>> ---
>>  fs/nfs/nfs4proc.c         |  134 ++++++++++++++++
>>  fs/nfs/nfs4xdr.c          |  302 +++++++++++++++++++++++++++++++++++
>>  fs/nfs/pnfs.c             |  382 ++++++++++++++++++++++++++++++++++++++++++---
>>  fs/nfs/pnfs.h             |   91 +++++++++++-
>>  include/linux/nfs4.h      |    2 +
>>  include/linux/nfs_fs_sb.h |    1 +
>>  include/linux/nfs_xdr.h   |   49 ++++++
>>  7 files changed, 935 insertions(+), 26 deletions(-)
>>
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index c7c7277..7eeea0e 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -55,6 +55,7 @@
>>  #include "internal.h"
>>  #include "iostat.h"
>>  #include "callback.h"
>> +#include "pnfs.h"
>>
>>  #define NFSDBG_FACILITY              NFSDBG_PROC
>>
>> @@ -5335,6 +5336,139 @@ out:
>>       dprintk("<-- %s status=%d\n", __func__, status);
>>       return status;
>>  }
>> +
>> +static void
>> +nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
>> +{
>> +     struct nfs4_layoutget *lgp = calldata;
>> +     struct inode *ino = lgp->args.inode;
>> +     struct nfs_server *server = NFS_SERVER(ino);
>> +
>> +     dprintk("--> %s\n", __func__);
>> +     if (nfs4_setup_sequence(server, &lgp->args.seq_args,
>> +                             &lgp->res.seq_res, 0, task))
>> +             return;
>> +     rpc_call_start(task);
>> +}
>> +
>> +static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
>> +{
>> +     struct nfs4_layoutget *lgp = calldata;
>> +     struct inode *ino = lgp->args.inode;
>> +     struct nfs_server *server = NFS_SERVER(ino);
>> +
>> +     dprintk("--> %s\n", __func__);
>> +
>> +     if (!nfs4_sequence_done(task, &lgp->res.seq_res))
>> +             return;
>> +
>> +     if (RPC_ASSASSINATED(task))
>> +             return;
>> +
>> +     if (nfs4_async_handle_error(task, server, NULL) == -EAGAIN)
>> +             nfs_restart_rpc(task, server->nfs_client);
>> +
>> +     lgp->status = task->tk_status;
>> +     dprintk("<-- %s\n", __func__);
>> +}
>> +
>> +static void nfs4_layoutget_release(void *calldata)
>> +{
>> +     struct nfs4_layoutget *lgp = calldata;
>> +
>> +     dprintk("--> %s\n", __func__);
>> +     put_layout_hdr(lgp->args.inode);
>> +     if (lgp->res.layout.buf != NULL)
>> +             free_page((unsigned long) lgp->res.layout.buf);
>> +     put_nfs_open_context(lgp->args.ctx);
>> +     kfree(calldata);
>> +     dprintk("<-- %s\n", __func__);
>> +}
>> +
>> +static const struct rpc_call_ops nfs4_layoutget_call_ops = {
>> +     .rpc_call_prepare = nfs4_layoutget_prepare,
>> +     .rpc_call_done = nfs4_layoutget_done,
>> +     .rpc_release = nfs4_layoutget_release,
>> +};
>> +
>> +static int _nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
>> +{
>> +     struct nfs_server *server = NFS_SERVER(lgp->args.inode);
>> +     struct rpc_task *task;
>> +     struct rpc_message msg = {
>> +             .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_LAYOUTGET],
>> +             .rpc_argp = &lgp->args,
>> +             .rpc_resp = &lgp->res,
>> +     };
>> +     struct rpc_task_setup task_setup_data = {
>> +             .rpc_client = server->client,
>> +             .rpc_message = &msg,
>> +             .callback_ops = &nfs4_layoutget_call_ops,
>> +             .callback_data = lgp,
>> +             .flags = RPC_TASK_ASYNC,
>> +     };
>> +     int status = 0;
>> +
>> +     dprintk("--> %s\n", __func__);
>> +
>> +     lgp->res.layout.buf = (void *)__get_free_page(GFP_NOFS);
>> +     if (lgp->res.layout.buf == NULL) {
>> +             nfs4_layoutget_release(lgp);
>> +             return -ENOMEM;
>> +     }
>> +
>> +     lgp->res.seq_res.sr_slotid = NFS4_MAX_SLOT_TABLE;
>> +     task = rpc_run_task(&task_setup_data);
>> +     if (IS_ERR(task))
>> +             return PTR_ERR(task);
>> +     status = nfs4_wait_for_completion_rpc_task(task);
>> +     if (status != 0)
>> +             goto out;
>> +     status = lgp->status;
>> +     if (status != 0)
>> +             goto out;
>> +     status = pnfs_layout_process(lgp);
>> +out:
>> +     rpc_put_task(task);
>> +     dprintk("<-- %s status=%d\n", __func__, status);
>> +     return status;
>> +}
>> +
>> +int nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
>> +{
>> +     struct nfs_server *server = NFS_SERVER(lgp->args.inode);
>> +     struct nfs4_exception exception = { };
>> +     int err;
>> +     do {
>> +             err = nfs4_handle_exception(server, _nfs4_proc_layoutget(lgp),
>> +                                         &exception);
>> +     } while (exception.retry);
>> +     return err;
>> +}
>
> Since nfs4_layoutget_done() already calls nfs4_async_handle_error(), do
> you really need to call nfs4_handle_exception()?
>


Hmmm, since it is being called synchronously at the moment, we should
probably remove the nfs4_async_handle_error call.


>> +
>> +int nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
>> +{
>> +     struct nfs4_getdeviceinfo_args args = {
>> +             .pdev = pdev,
>> +     };
>> +     struct nfs4_getdeviceinfo_res res = {
>> +             .pdev = pdev,
>> +     };
>> +     struct rpc_message msg = {
>> +             .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_GETDEVICEINFO],
>> +             .rpc_argp = &args,
>> +             .rpc_resp = &res,
>> +     };
>> +     int status;
>> +
>> +     dprintk("--> %s\n", __func__);
>> +     status = nfs4_call_sync(server, &msg, &args, &res, 0);
>> +     dprintk("<-- %s status=%d\n", __func__, status);
>> +
>> +     return status;
>> +}
>> +EXPORT_SYMBOL_GPL(nfs4_proc_getdeviceinfo);
>> +
>
> This, on the other hand, might need a 'handle exception' wrapper.

I agree.


>
>>  #endif /* CONFIG_NFS_V4_1 */
>>
>>  struct nfs4_state_recovery_ops nfs40_reboot_recovery_ops = {
>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
>> index 60233ae..aaf6fe5 100644
>> --- a/fs/nfs/nfs4xdr.c
>> +++ b/fs/nfs/nfs4xdr.c
>> @@ -52,6 +52,7 @@
>>  #include <linux/nfs_idmap.h>
>>  #include "nfs4_fs.h"
>>  #include "internal.h"
>> +#include "pnfs.h"
>>
>>  #define NFSDBG_FACILITY              NFSDBG_XDR
>>
>> @@ -310,6 +311,19 @@ static int nfs4_stat_to_errno(int);
>>                               XDR_QUADLEN(NFS4_MAX_SESSIONID_LEN) + 5)
>>  #define encode_reclaim_complete_maxsz        (op_encode_hdr_maxsz + 4)
>>  #define decode_reclaim_complete_maxsz        (op_decode_hdr_maxsz + 4)
>> +#define encode_getdeviceinfo_maxsz (op_encode_hdr_maxsz + 4 + \
>> +                             XDR_QUADLEN(NFS4_PNFS_DEVICEID4_SIZE))
>> +#define decode_getdeviceinfo_maxsz (op_decode_hdr_maxsz + \
>> +                             1 /* layout type */ + \
>> +                             1 /* opaque devaddr4 length */ + \
>> +                               /* devaddr4 payload is read into page */ \
>> +                             1 /* notification bitmap length */ + \
>> +                             1 /* notification bitmap */)
>> +#define encode_layoutget_maxsz       (op_encode_hdr_maxsz + 10 + \
>> +                             encode_stateid_maxsz)
>> +#define decode_layoutget_maxsz       (op_decode_hdr_maxsz + 8 + \
>> +                             decode_stateid_maxsz + \
>> +                             XDR_QUADLEN(PNFS_LAYOUT_MAXSIZE))
>>  #else /* CONFIG_NFS_V4_1 */
>>  #define encode_sequence_maxsz        0
>>  #define decode_sequence_maxsz        0
>> @@ -699,6 +713,20 @@ static int nfs4_stat_to_errno(int);
>>  #define NFS4_dec_reclaim_complete_sz (compound_decode_hdr_maxsz + \
>>                                        decode_sequence_maxsz + \
>>                                        decode_reclaim_complete_maxsz)
>> +#define NFS4_enc_getdeviceinfo_sz (compound_encode_hdr_maxsz +    \
>> +                             encode_sequence_maxsz +\
>> +                             encode_getdeviceinfo_maxsz)
>> +#define NFS4_dec_getdeviceinfo_sz (compound_decode_hdr_maxsz +    \
>> +                             decode_sequence_maxsz + \
>> +                             decode_getdeviceinfo_maxsz)
>> +#define NFS4_enc_layoutget_sz        (compound_encode_hdr_maxsz + \
>> +                             encode_sequence_maxsz + \
>> +                             encode_putfh_maxsz +        \
>> +                             encode_layoutget_maxsz)
>> +#define NFS4_dec_layoutget_sz        (compound_decode_hdr_maxsz + \
>> +                             decode_sequence_maxsz + \
>> +                             decode_putfh_maxsz +        \
>> +                             decode_layoutget_maxsz)
>>
>>  const u32 nfs41_maxwrite_overhead = ((RPC_MAX_HEADER_WITH_AUTH +
>>                                     compound_encode_hdr_maxsz +
>> @@ -1726,6 +1754,61 @@ static void encode_sequence(struct xdr_stream *xdr,
>>  #endif /* CONFIG_NFS_V4_1 */
>>  }
>>
>> +#ifdef CONFIG_NFS_V4_1
>> +static void
>> +encode_getdeviceinfo(struct xdr_stream *xdr,
>> +                  const struct nfs4_getdeviceinfo_args *args,
>> +                  struct compound_hdr *hdr)
>> +{
>> +     int has_bitmap = (args->pdev->dev_notify_types != 0);
>> +     int len = 16 + NFS4_PNFS_DEVICEID4_SIZE + (has_bitmap * 4);
>> +     __be32 *p;
>> +
>> +     p = reserve_space(xdr, len);
>> +     *p++ = cpu_to_be32(OP_GETDEVICEINFO);
>> +     p = xdr_encode_opaque_fixed(p, args->pdev->dev_id.data,
>> +                                 NFS4_PNFS_DEVICEID4_SIZE);
>> +     *p++ = cpu_to_be32(args->pdev->layout_type);
>> +     *p++ = cpu_to_be32(args->pdev->pglen);          /* gdia_maxcount */
>> +     *p++ = cpu_to_be32(has_bitmap);                 /* bitmap length [01] */
>> +     if (has_bitmap)
>> +             *p = cpu_to_be32(args->pdev->dev_notify_types);
>
> We don't support notification callbacks yet.
>

OK, I'll rip this out and just set the bitmap to zero.

>> +     hdr->nops++;
>> +     hdr->replen += decode_getdeviceinfo_maxsz;
>> +}
>> +
>> +static void
>> +encode_layoutget(struct xdr_stream *xdr,
>> +                   const struct nfs4_layoutget_args *args,
>> +                   struct compound_hdr *hdr)
>> +{
>> +     nfs4_stateid stateid;
>> +     __be32 *p;
>> +
>> +     p = reserve_space(xdr, 44 + NFS4_STATEID_SIZE);
>> +     *p++ = cpu_to_be32(OP_LAYOUTGET);
>> +     *p++ = cpu_to_be32(0);     /* Signal layout available */
>> +     *p++ = cpu_to_be32(args->type);
>> +     *p++ = cpu_to_be32(args->range.iomode);
>> +     p = xdr_encode_hyper(p, args->range.offset);
>> +     p = xdr_encode_hyper(p, args->range.length);
>> +     p = xdr_encode_hyper(p, args->minlength);
>> +     pnfs_get_layout_stateid(&stateid, NFS_I(args->inode)->layout);
>> +     p = xdr_encode_opaque_fixed(p, &stateid.data, NFS4_STATEID_SIZE);
>> +     *p = cpu_to_be32(args->maxcount);
>> +
>> +     dprintk("%s: 1st type:0x%x iomode:%d off:%lu len:%lu mc:%d\n",
>> +             __func__,
>> +             args->type,
>> +             args->range.iomode,
>> +             (unsigned long)args->range.offset,
>> +             (unsigned long)args->range.length,
>> +             args->maxcount);
>> +     hdr->nops++;
>> +     hdr->replen += decode_layoutget_maxsz;
>> +}
>> +#endif /* CONFIG_NFS_V4_1 */
>> +
>>  /*
>>   * END OF "GENERIC" ENCODE ROUTINES.
>>   */
>> @@ -2543,6 +2626,51 @@ static int nfs4_xdr_enc_reclaim_complete(struct rpc_rqst *req, uint32_t *p,
>>       return 0;
>>  }
>>
>> +/*
>> + * Encode GETDEVICEINFO request
>> + */
>> +static int nfs4_xdr_enc_getdeviceinfo(struct rpc_rqst *req, uint32_t *p,
>> +                                   struct nfs4_getdeviceinfo_args *args)
>> +{
>> +     struct xdr_stream xdr;
>> +     struct compound_hdr hdr = {
>> +             .minorversion = nfs4_xdr_minorversion(&args->seq_args),
>> +     };
>> +
>> +     xdr_init_encode(&xdr, &req->rq_snd_buf, p);
>> +     encode_compound_hdr(&xdr, req, &hdr);
>> +     encode_sequence(&xdr, &args->seq_args, &hdr);
>> +     encode_getdeviceinfo(&xdr, args, &hdr);
>> +
>> +     /* set up reply kvec. Subtract notification bitmap max size (2)
>> +      * so that notification bitmap is put in xdr_buf tail */
>> +     xdr_inline_pages(&req->rq_rcv_buf, (hdr.replen - 2) << 2,
>> +                      args->pdev->pages, args->pdev->pgbase,
>> +                      args->pdev->pglen);
>> +
>> +     encode_nops(&hdr);
>> +     return 0;
>> +}
>> +
>> +/*
>> + *  Encode LAYOUTGET request
>> + */
>> +static int nfs4_xdr_enc_layoutget(struct rpc_rqst *req, uint32_t *p,
>> +                               struct nfs4_layoutget_args *args)
>> +{
>> +     struct xdr_stream xdr;
>> +     struct compound_hdr hdr = {
>> +             .minorversion = nfs4_xdr_minorversion(&args->seq_args),
>> +     };
>> +
>> +     xdr_init_encode(&xdr, &req->rq_snd_buf, p);
>> +     encode_compound_hdr(&xdr, req, &hdr);
>> +     encode_sequence(&xdr, &args->seq_args, &hdr);
>> +     encode_putfh(&xdr, NFS_FH(args->inode), &hdr);
>> +     encode_layoutget(&xdr, args, &hdr);
>> +     encode_nops(&hdr);
>> +     return 0;
>> +}
>>  #endif /* CONFIG_NFS_V4_1 */
>>
>>  static void print_overflow_msg(const char *func, const struct xdr_stream *xdr)
>> @@ -4788,6 +4916,131 @@ out_overflow:
>>  #endif /* CONFIG_NFS_V4_1 */
>>  }
>>
>> +#if defined(CONFIG_NFS_V4_1)
>> +
>> +static int decode_getdeviceinfo(struct xdr_stream *xdr,
>> +                             struct pnfs_device *pdev)
>> +{
>> +     __be32 *p;
>> +     uint32_t len, type;
>> +     int status;
>> +
>> +     status = decode_op_hdr(xdr, OP_GETDEVICEINFO);
>> +     if (status) {
>> +             if (status == -ETOOSMALL) {
>> +                     p = xdr_inline_decode(xdr, 4);
>> +                     if (unlikely(!p))
>> +                             goto out_overflow;
>> +                     pdev->mincount = be32_to_cpup(p);
>> +                     dprintk("%s: Min count too small. mincnt = %u\n",
>> +                             __func__, pdev->mincount);
>> +             }
>> +             return status;
>> +     }
>> +
>> +     p = xdr_inline_decode(xdr, 8);
>> +     if (unlikely(!p))
>> +             goto out_overflow;
>> +     type = be32_to_cpup(p++);
>> +     if (type != pdev->layout_type) {
>> +             dprintk("%s: layout mismatch req: %u pdev: %u\n",
>> +                     __func__, pdev->layout_type, type);
>> +             return -EINVAL;
>> +     }
>> +     /*
>> +      * Get the length of the opaque device_addr4. xdr_read_pages places
>> +      * the opaque device_addr4 in the xdr_buf->pages (pnfs_device->pages)
>> +      * and places the remaining xdr data in xdr_buf->tail
>> +      */
>> +     pdev->mincount = be32_to_cpup(p);
>> +     xdr_read_pages(xdr, pdev->mincount); /* include space for the length */
>> +
>> +     /*
>> +      * At most one bitmap word. If the server returns a bitmap of more
>> +      * than one word we ignore the extra invalid words given that
>> +      * getdeviceinfo is the final operation in the compound.
>> +      */
>> +     p = xdr_inline_decode(xdr, 4);
>> +     if (unlikely(!p))
>> +             goto out_overflow;
>> +     len = be32_to_cpup(p);
>> +     if (len) {
>> +             p = xdr_inline_decode(xdr, 4);
>> +             if (unlikely(!p))
>> +                     goto out_overflow;
>> +             pdev->dev_notify_types = be32_to_cpup(p);
>> +     } else
>> +             pdev->dev_notify_types = 0;
>
> Again, we don't support notifications.
>

OK.


>> +     return 0;
>> +out_overflow:
>> +     print_overflow_msg(__func__, xdr);
>> +     return -EIO;
>> +}
>> +
>> +static int decode_layoutget(struct xdr_stream *xdr, struct rpc_rqst *req,
>> +                         struct nfs4_layoutget_res *res)
>> +{
>> +     __be32 *p;
>> +     int status;
>> +     u32 layout_count;
>> +
>> +     status = decode_op_hdr(xdr, OP_LAYOUTGET);
>> +     if (status)
>> +             return status;
>> +     p = xdr_inline_decode(xdr, 8 + NFS4_STATEID_SIZE);
>> +     if (unlikely(!p))
>> +             goto out_overflow;
>> +     res->return_on_close = be32_to_cpup(p++);
>> +     p = xdr_decode_opaque_fixed(p, res->stateid.data, NFS4_STATEID_SIZE);
>> +     layout_count = be32_to_cpup(p);
>> +     if (!layout_count) {
>> +             dprintk("%s: server responded with empty layout array\n",
>> +                     __func__);
>> +             return -EINVAL;
>> +     }
>> +
>> +     p = xdr_inline_decode(xdr, 24);
>> +     if (unlikely(!p))
>> +             goto out_overflow;
>> +     p = xdr_decode_hyper(p, &res->range.offset);
>> +     p = xdr_decode_hyper(p, &res->range.length);
>> +     res->range.iomode = be32_to_cpup(p++);
>> +     res->type = be32_to_cpup(p++);
>> +
>> +     status = decode_opaque_inline(xdr, &res->layout.len, (char **)&p);
>> +     if (unlikely(status))
>> +             return status;
>> +
>> +     dprintk("%s roff:%lu rlen:%lu riomode:%d, lo_type:0x%x, lo.len:%d\n",
>> +             __func__,
>> +             (unsigned long)res->range.offset,
>> +             (unsigned long)res->range.length,
>> +             res->range.iomode,
>> +             res->type,
>> +             res->layout.len);
>> +
>> +     /* nfs4_proc_layoutget allocated a single page */
>> +     if (res->layout.len > PAGE_SIZE)
>> +             return -ENOMEM;
>> +     memcpy(res->layout.buf, p, res->layout.len);
>> +
>> +     if (layout_count > 1) {
>> +             /* We only handle a length one array at the moment.  Any
>> +              * further entries are just ignored.  Note that this means
>> +              * the client may see a response that is less than the
>> +              * minimum it requested.
>> +              */
>> +             dprintk("%s: server responded with %d layouts, dropping tail\n",
>> +                     __func__, layout_count);
>> +     }
>> +
>> +     return 0;
>> +out_overflow:
>> +     print_overflow_msg(__func__, xdr);
>> +     return -EIO;
>> +}
>> +#endif /* CONFIG_NFS_V4_1 */
>> +
>>  /*
>>   * END OF "GENERIC" DECODE ROUTINES.
>>   */
>> @@ -5815,6 +6068,53 @@ static int nfs4_xdr_dec_reclaim_complete(struct rpc_rqst *rqstp, uint32_t *p,
>>               status = decode_reclaim_complete(&xdr, (void *)NULL);
>>       return status;
>>  }
>> +
>> +/*
>> + * Decode GETDEVINFO response
>> + */
>> +static int nfs4_xdr_dec_getdeviceinfo(struct rpc_rqst *rqstp, uint32_t *p,
>> +                                   struct nfs4_getdeviceinfo_res *res)
>> +{
>> +     struct xdr_stream xdr;
>> +     struct compound_hdr hdr;
>> +     int status;
>> +
>> +     xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
>> +     status = decode_compound_hdr(&xdr, &hdr);
>> +     if (status != 0)
>> +             goto out;
>> +     status = decode_sequence(&xdr, &res->seq_res, rqstp);
>> +     if (status != 0)
>> +             goto out;
>> +     status = decode_getdeviceinfo(&xdr, res->pdev);
>> +out:
>> +     return status;
>> +}
>> +
>> +/*
>> + * Decode LAYOUTGET response
>> + */
>> +static int nfs4_xdr_dec_layoutget(struct rpc_rqst *rqstp, uint32_t *p,
>> +                               struct nfs4_layoutget_res *res)
>> +{
>> +     struct xdr_stream xdr;
>> +     struct compound_hdr hdr;
>> +     int status;
>> +
>> +     xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
>> +     status = decode_compound_hdr(&xdr, &hdr);
>> +     if (status)
>> +             goto out;
>> +     status = decode_sequence(&xdr, &res->seq_res, rqstp);
>> +     if (status)
>> +             goto out;
>> +     status = decode_putfh(&xdr);
>> +     if (status)
>> +             goto out;
>> +     status = decode_layoutget(&xdr, rqstp, res);
>> +out:
>> +     return status;
>> +}
>>  #endif /* CONFIG_NFS_V4_1 */
>>
>>  __be32 *nfs4_decode_dirent(__be32 *p, struct nfs_entry *entry, int plus)
>> @@ -5993,6 +6293,8 @@ struct rpc_procinfo     nfs4_procedures[] = {
>>    PROC(SEQUENCE,     enc_sequence,   dec_sequence),
>>    PROC(GET_LEASE_TIME,       enc_get_lease_time,     dec_get_lease_time),
>>    PROC(RECLAIM_COMPLETE, enc_reclaim_complete,  dec_reclaim_complete),
>> +  PROC(GETDEVICEINFO, enc_getdeviceinfo, dec_getdeviceinfo),
>> +  PROC(LAYOUTGET,  enc_layoutget,     dec_layoutget),
>>  #endif /* CONFIG_NFS_V4_1 */
>>  };
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index cbce942..faf6c4c 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -128,6 +128,12 @@ pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>>               return status;
>>       }
>>
>> +     if (!io_ops->alloc_lseg || !io_ops->free_lseg) {
>> +             printk(KERN_ERR "%s Layout driver must provide "
>> +                    "alloc_lseg and free_lseg.\n", __func__);
>> +             return status;
>> +     }
>> +
>>       spin_lock(&pnfs_spinlock);
>>       if (!find_pnfs_driver_locked(ld_type->id)) {
>>               list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
>> @@ -153,6 +159,10 @@ pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>>  }
>>  EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
>>
>> +/*
>> + * pNFS client layout cache
>> + */
>> +
>>  static void
>>  get_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>>  {
>> @@ -175,6 +185,15 @@ put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>>       }
>>  }
>>
>> +void
>> +put_layout_hdr(struct inode *inode)
>> +{
>> +     spin_lock(&inode->i_lock);
>> +     put_layout_hdr_locked(NFS_I(inode)->layout);
>> +     spin_unlock(&inode->i_lock);
>> +
>> +}
>> +
>>  static void
>>  init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
>>  {
>> @@ -191,7 +210,7 @@ destroy_lseg(struct kref *kref)
>>       struct pnfs_layout_hdr *local = lseg->layout;
>>
>>       dprintk("--> %s\n", __func__);
>> -     kfree(lseg);
>> +     PNFS_LD_IO_OPS(local)->free_lseg(lseg);
>
> Where is PNFS_LD_IO_OPS() defined? Besides, I thought we agreed to get
> rid of that.

This is defined in pnfs.h as
PNFS_NFS_SERVER()->pnfs_curr_ld->ld_io_iops, mainly to save typing.

The macro that you had objected to was PNFS_EXISTS_LDIO_OP form
Benny's tree, which is now gone.

>
>>       /* Matched by get_layout_hdr_locked in pnfs_insert_layout */
>>       put_layout_hdr_locked(local);
>>  }
>> @@ -226,6 +245,7 @@ pnfs_clear_lseg_list(struct pnfs_layout_hdr *lo)
>>       /* List does not take a reference, so no need for put here */
>>       list_del_init(&lo->layouts);
>>       spin_unlock(&clp->cl_lock);
>> +     pnfs_set_layout_stateid(lo, &zero_stateid);
>>
>>       dprintk("%s:Return\n", __func__);
>>  }
>> @@ -268,40 +288,120 @@ pnfs_destroy_all_layouts(struct nfs_client *clp)
>>       }
>>  }
>>
>> -static void pnfs_insert_layout(struct pnfs_layout_hdr *lo,
>> -                            struct pnfs_layout_segment *lseg);
>> +void
>> +pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
>> +                     const nfs4_stateid *stateid)
>> +{
>> +     write_seqlock(&lo->seqlock);
>> +     memcpy(lo->stateid.data, stateid->data, sizeof(lo->stateid.data));
>> +     write_sequnlock(&lo->seqlock);
>> +}
>> +
>> +void
>> +pnfs_get_layout_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo)
>> +{
>> +     int seq;
>>
>> -/* Get layout from server. */
>> +     dprintk("--> %s\n", __func__);
>> +
>> +     do {
>> +             seq = read_seqbegin(&lo->seqlock);
>> +             memcpy(dst->data, lo->stateid.data,
>> +                    sizeof(lo->stateid.data));
>> +     } while (read_seqretry(&lo->seqlock, seq));
>> +
>> +     dprintk("<-- %s\n", __func__);
>> +}
>> +
>> +static void
>> +pnfs_layout_from_open_stateid(struct pnfs_layout_hdr *lo,
>> +                           struct nfs4_state *state)
>> +{
>> +     int seq;
>> +
>> +     dprintk("--> %s\n", __func__);
>> +
>> +     write_seqlock(&lo->seqlock);
>> +     /* Zero stateid, which is illegal to use in layout, is our
>> +      * marker for an un-initialized stateid.
>> +      */
>
> Isn't it easier just to have a flag in the layout?
>
>> +     if (!memcmp(lo->stateid.data, &zero_stateid, NFS4_STATEID_SIZE))
>> +             do {
>> +                     seq = read_seqbegin(&state->seqlock);
>> +                     memcpy(lo->stateid.data, state->stateid.data,
>> +                                     sizeof(state->stateid.data));
>> +             } while (read_seqretry(&state->seqlock, seq));
>> +     write_sequnlock(&lo->seqlock);
>
> ...and if memcmp(), is the caller supposed to detect that nothing was
> done?
>
>> +     dprintk("<-- %s\n", __func__);
>> +}
>> +
>> +/*
>> +* Get layout from server.
>> +*    for now, assume that whole file layouts are requested.
>> +*    arg->offset: 0
>> +*    arg->length: all ones
>> +*/
>>  static struct pnfs_layout_segment *
>>  send_layoutget(struct pnfs_layout_hdr *lo,
>>          struct nfs_open_context *ctx,
>>          u32 iomode)
>>  {
>>       struct inode *ino = lo->inode;
>> -     struct pnfs_layout_segment *lseg;
>> +     struct nfs_server *server = NFS_SERVER(ino);
>> +     struct nfs4_layoutget *lgp;
>> +     struct pnfs_layout_segment *lseg = NULL;
>>
>> -     /* Lets pretend we sent LAYOUTGET and got a response */
>> -     lseg = kzalloc(sizeof(*lseg), GFP_KERNEL);
>> +     dprintk("--> %s\n", __func__);
>> +
>> +     BUG_ON(ctx == NULL);
>> +     lgp = kzalloc(sizeof(*lgp), GFP_KERNEL);
>> +     if (lgp == NULL) {
>> +             put_layout_hdr(lo->inode);
>> +             return NULL;
>> +     }
>> +     lgp->args.minlength = NFS4_MAX_UINT64;
>> +     lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
>> +     lgp->args.range.iomode = iomode;
>> +     lgp->args.range.offset = 0;
>> +     lgp->args.range.length = NFS4_MAX_UINT64;
>> +     lgp->args.type = server->pnfs_curr_ld->id;
>> +     lgp->args.inode = ino;
>> +     lgp->args.ctx = get_nfs_open_context(ctx);
>> +     lgp->lsegpp = &lseg;
>> +
>> +     if (!memcmp(lo->stateid.data, &zero_stateid, NFS4_STATEID_SIZE))
>> +             pnfs_layout_from_open_stateid(NFS_I(ino)->layout, ctx->state);
>
> Why do an extra memcmp() here?

OK, clearly the function and call to pnfs_layout_from_open_stateid
need to be reexamined.

Fred

>
>> +
>> +     /* Synchronously retrieve layout information from server and
>> +      * store in lseg.
>> +      */
>> +     nfs4_proc_layoutget(lgp);
>>       if (!lseg) {
>> +             /* remember that LAYOUTGET failed and suspend trying */
>>               set_bit(lo_fail_bit(iomode), &lo->state);
>> -             spin_lock(&ino->i_lock);
>> -             put_layout_hdr_locked(lo);
>> -             spin_unlock(&ino->i_lock);
>> -             return NULL;
>>       }
>> -     init_lseg(lo, lseg);
>> -     lseg->iomode = IOMODE_RW;
>> -     spin_lock(&ino->i_lock);
>> -     pnfs_insert_layout(lo, lseg);
>> -     put_layout_hdr_locked(lo);
>> -     spin_unlock(&ino->i_lock);
>>       return lseg;
>>  }
>>
>> +/*
>> + * Compare two layout segments for sorting into layout cache.
>> + * We want to preferentially return RW over RO layouts, so ensure those
>> + * are seen first.
>> + */
>> +static s64
>> +cmp_layout(u32 iomode1, u32 iomode2)
>> +{
>> +     /* read > read/write */
>> +     return (int)(iomode2 == IOMODE_READ) - (int)(iomode1 == IOMODE_READ);
>> +}
>> +
>>  static void
>>  pnfs_insert_layout(struct pnfs_layout_hdr *lo,
>>                  struct pnfs_layout_segment *lseg)
>>  {
>> +     struct pnfs_layout_segment *lp;
>> +     int found = 0;
>> +
>>       dprintk("%s:Begin\n", __func__);
>>
>>       assert_spin_locked(&lo->inode->i_lock);
>> @@ -313,13 +413,28 @@ pnfs_insert_layout(struct pnfs_layout_hdr *lo,
>>               list_add_tail(&lo->layouts, &clp->cl_layouts);
>>               spin_unlock(&clp->cl_lock);
>>       }
>> -     /* STUB - add the constructed lseg if necessary */
>> -     if (list_empty(&lo->segs)) {
>> +     list_for_each_entry(lp, &lo->segs, fi_list) {
>> +             if (cmp_layout(lp->range.iomode, lseg->range.iomode) > 0)
>> +                     continue;
>> +             list_add_tail(&lseg->fi_list, &lp->fi_list);
>> +             dprintk("%s: inserted lseg %p "
>> +                     "iomode %d offset %llu length %llu before "
>> +                     "lp %p iomode %d offset %llu length %llu\n",
>> +                     __func__, lseg, lseg->range.iomode,
>> +                     lseg->range.offset, lseg->range.length,
>> +                     lp, lp->range.iomode, lp->range.offset,
>> +                     lp->range.length);
>> +             found = 1;
>> +             break;
>> +     }
>> +     if (!found) {
>>               list_add_tail(&lseg->fi_list, &lo->segs);
>> -             get_layout_hdr_locked(lo);
>> -             dprintk("%s: inserted lseg %p iomode %d at tail\n",
>> -                     __func__, lseg, lseg->iomode);
>> +             dprintk("%s: inserted lseg %p "
>> +                     "iomode %d offset %llu length %llu at tail\n",
>> +                     __func__, lseg, lseg->range.iomode,
>> +                     lseg->range.offset, lseg->range.length);
>>       }
>> +     get_layout_hdr_locked(lo);
>>
>>       dprintk("%s:Return\n", __func__);
>>  }
>> @@ -335,6 +450,7 @@ alloc_init_layout_hdr(struct inode *ino)
>>       lo->refcount = 1;
>>       INIT_LIST_HEAD(&lo->layouts);
>>       INIT_LIST_HEAD(&lo->segs);
>> +     seqlock_init(&lo->seqlock);
>>       lo->inode = ino;
>>       return lo;
>>  }
>> @@ -362,11 +478,46 @@ pnfs_find_alloc_layout(struct inode *ino)
>>       return nfsi->layout;
>>  }
>>
>> -/* STUB - LAYOUTGET never succeeds, so cache is empty */
>> +/*
>> + * iomode matching rules:
>> + * iomode    lseg    match
>> + * -----     -----   -----
>> + * ANY               READ    true
>> + * ANY               RW      true
>> + * RW                READ    false
>> + * RW                RW      true
>> + * READ              READ    true
>> + * READ              RW      true
>> + */
>> +static int
>> +is_matching_lseg(struct pnfs_layout_segment *lseg, u32 iomode)
>> +{
>> +     return (iomode != IOMODE_RW || lseg->range.iomode == IOMODE_RW);
>> +}
>> +
>> +/*
>> + * lookup range in layout
>> + */
>>  static struct pnfs_layout_segment *
>>  pnfs_has_layout(struct pnfs_layout_hdr *lo, u32 iomode)
>>  {
>> -     return NULL;
>> +     struct pnfs_layout_segment *lseg, *ret = NULL;
>> +
>> +     dprintk("%s:Begin\n", __func__);
>> +
>> +     assert_spin_locked(&lo->inode->i_lock);
>> +     list_for_each_entry(lseg, &lo->segs, fi_list) {
>> +             if (is_matching_lseg(lseg, iomode)) {
>> +                     ret = lseg;
>> +                     break;
>> +             }
>> +             if (cmp_layout(iomode, lseg->range.iomode) > 0)
>> +                     break;
>> +     }
>> +
>> +     dprintk("%s:Return lseg %p ref %d\n",
>> +             __func__, ret, ret ? atomic_read(&ret->kref.refcount) : 0);
>> +     return ret;
>>  }
>>
>>  /*
>> @@ -403,7 +554,7 @@ pnfs_update_layout(struct inode *ino,
>>       if (test_bit(lo_fail_bit(iomode), &nfsi->layout->state))
>>               goto out_unlock;
>>
>> -     get_layout_hdr_locked(lo);
>> +     get_layout_hdr_locked(lo); /* Matched in nfs4_layoutget_release */
>>       spin_unlock(&ino->i_lock);
>>
>>       lseg = send_layoutget(lo, ctx, iomode);
>> @@ -415,3 +566,184 @@ out_unlock:
>>       spin_unlock(&ino->i_lock);
>>       goto out;
>>  }
>> +
>> +int
>> +pnfs_layout_process(struct nfs4_layoutget *lgp)
>> +{
>> +     struct pnfs_layout_hdr *lo = NFS_I(lgp->args.inode)->layout;
>> +     struct nfs4_layoutget_res *res = &lgp->res;
>> +     struct pnfs_layout_segment *lseg;
>> +     struct inode *ino = lo->inode;
>> +     int status = 0;
>> +
>> +     /* Inject layout blob into I/O device driver */
>> +     lseg = PNFS_LD_IO_OPS(lo)->alloc_lseg(lo, res);
>                 ^^^^^^^^^^^^^^
>
>> +     if (!lseg || IS_ERR(lseg)) {
>> +             if (!lseg)
>> +                     status = -ENOMEM;
>> +             else
>> +                     status = PTR_ERR(lseg);
>> +             dprintk("%s: Could not allocate layout: error %d\n",
>> +                    __func__, status);
>> +             goto out;
>> +     }
>> +
>> +     spin_lock(&ino->i_lock);
>> +     init_lseg(lo, lseg);
>> +     lseg->range = res->range;
>> +     *lgp->lsegpp = lseg;
>> +     pnfs_insert_layout(lo, lseg);
>> +
>> +     /* Done processing layoutget. Set the layout stateid */
>> +     pnfs_set_layout_stateid(lo, &res->stateid);
>> +     spin_unlock(&ino->i_lock);
>> +out:
>> +     return status;
>> +}
>> +
>> +/*
>> + * Device ID cache. Currently supports one layout type per struct nfs_client.
>> + * Add layout type to the lookup key to expand to support multiple types.
>> + */
>> +int
>> +nfs4_alloc_init_deviceid_cache(struct nfs_client *clp,
>> +                      void (*free_callback)(struct nfs4_deviceid *))
>> +{
>> +     struct nfs4_deviceid_cache *c;
>> +
>> +     c = kzalloc(sizeof(struct nfs4_deviceid_cache), GFP_KERNEL);
>> +     if (!c)
>> +             return -ENOMEM;
>> +     spin_lock(&clp->cl_lock);
>> +     if (clp->cl_devid_cache != NULL) {
>> +             atomic_inc(&clp->cl_devid_cache->dc_ref);
>> +             dprintk("%s [kref [%d]]\n", __func__,
>> +                     atomic_read(&clp->cl_devid_cache->dc_ref));
>> +             kfree(c);
>> +     } else {
>> +             /* kzalloc initializes hlists */
>> +             spin_lock_init(&c->dc_lock);
>> +             atomic_set(&c->dc_ref, 1);
>> +             c->dc_free_callback = free_callback;
>> +             clp->cl_devid_cache = c;
>> +             dprintk("%s [new]\n", __func__);
>> +     }
>> +     spin_unlock(&clp->cl_lock);
>> +     return 0;
>> +}
>> +EXPORT_SYMBOL(nfs4_alloc_init_deviceid_cache);
>> +
>> +void
>> +nfs4_init_deviceid_node(struct nfs4_deviceid *d)
>> +{
>> +     INIT_HLIST_NODE(&d->de_node);
>> +     atomic_set(&d->de_ref, 1);
>> +}
>> +EXPORT_SYMBOL(nfs4_init_deviceid_node);
>> +
>> +/* Called from layoutdriver_io_operations->alloc_lseg */
>> +void
>> +nfs4_set_layout_deviceid(struct pnfs_layout_segment *l, struct nfs4_deviceid *d)
>> +{
>> +     dprintk("%s [%d]\n", __func__, atomic_read(&d->de_ref));
>> +     l->deviceid = d;
>> +}
>> +EXPORT_SYMBOL(nfs4_set_layout_deviceid);
>> +
>> +/*
>> + * Called from layoutdriver_io_operations->free_lseg
>> + * last layout segment reference frees deviceid
>> + */
>> +void
>> +nfs4_put_layout_deviceid(struct pnfs_layout_segment *l)
>> +{
>> +     struct nfs4_deviceid_cache *c =
>> +             NFS_SERVER(l->layout->inode)->nfs_client->cl_devid_cache;
>> +     struct pnfs_deviceid *id = &l->deviceid->de_id;
>> +     struct nfs4_deviceid *d;
>> +     struct hlist_node *n;
>> +     long h = nfs4_deviceid_hash(id);
>> +
>> +     dprintk("%s [%d]\n", __func__, atomic_read(&l->deviceid->de_ref));
>> +     if (!atomic_dec_and_lock(&l->deviceid->de_ref, &c->dc_lock))
>> +             return;
>> +
>> +     hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[h], de_node)
>> +             if (!memcmp(&d->de_id, id, sizeof(*id))) {
>> +                     hlist_del_rcu(&d->de_node);
>> +                     spin_unlock(&c->dc_lock);
>> +                     synchronize_rcu();
>> +                     c->dc_free_callback(l->deviceid);
>> +                     return;
>> +             }
>> +     spin_unlock(&c->dc_lock);
>> +}
>> +EXPORT_SYMBOL(nfs4_put_layout_deviceid);
>> +
>> +/* Find and reference a deviceid */
>> +struct nfs4_deviceid *
>> +nfs4_find_get_deviceid(struct nfs4_deviceid_cache *c, struct pnfs_deviceid *id)
>> +{
>> +     struct nfs4_deviceid *d;
>> +     struct hlist_node *n;
>> +     long hash = nfs4_deviceid_hash(id);
>> +
>> +     dprintk("--> %s hash %ld\n", __func__, hash);
>> +     rcu_read_lock();
>> +     hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[hash], de_node) {
>> +             if (!memcmp(&d->de_id, id, sizeof(*id))) {
>> +                     if (!atomic_inc_not_zero(&d->de_ref)) {
>> +                             goto fail;
>> +                     } else {
>> +                             rcu_read_unlock();
>> +                             return d;
>> +                     }
>> +             }
>> +     }
>> +fail:
>> +     rcu_read_unlock();
>> +     return NULL;
>> +}
>> +EXPORT_SYMBOL(nfs4_find_get_deviceid);
>> +
>> +/*
>> + * Add a deviceid to the cache.
>> + * GETDEVICEINFOs for same deviceid can race. If deviceid is found, discard new
>> + */
>> +struct nfs4_deviceid *
>> +nfs4_add_deviceid(struct nfs4_deviceid_cache *c, struct nfs4_deviceid *new)
>> +{
>> +     struct nfs4_deviceid *d;
>> +     struct hlist_node *n;
>> +     long hash = nfs4_deviceid_hash(&new->de_id);
>> +
>> +     dprintk("--> %s hash %ld\n", __func__, hash);
>> +     spin_lock(&c->dc_lock);
>> +     hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[hash], de_node) {
>> +             if (!memcmp(&d->de_id, &new->de_id, sizeof(new->de_id))) {
>> +                     spin_unlock(&c->dc_lock);
>> +                     dprintk("%s [discard]\n", __func__);
>> +                     c->dc_free_callback(new);
>> +                     return d;
>> +             }
>> +     }
>> +     hlist_add_head_rcu(&new->de_node, &c->dc_deviceids[hash]);
>> +     spin_unlock(&c->dc_lock);
>> +     dprintk("%s [new]\n", __func__);
>> +     return new;
>> +}
>> +EXPORT_SYMBOL(nfs4_add_deviceid);
>> +
>> +void
>> +nfs4_put_deviceid_cache(struct nfs_client *clp)
>> +{
>> +     struct nfs4_deviceid_cache *local = clp->cl_devid_cache;
>> +
>> +     dprintk("--> %s cl_devid_cache %p\n", __func__, clp->cl_devid_cache);
>> +     if (atomic_dec_and_lock(&local->dc_ref, &clp->cl_lock)) {
>> +             clp->cl_devid_cache = NULL;
>> +             spin_unlock(&clp->cl_lock);
>> +             kfree(local);
>> +     }
>> +}
>> +EXPORT_SYMBOL(nfs4_put_deviceid_cache);
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index dac6a72..d343f83 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -12,11 +12,14 @@
>>
>>  struct pnfs_layout_segment {
>>       struct list_head fi_list;
>> -     u32 iomode;
>> +     struct pnfs_layout_range range;
>>       struct kref kref;
>>       struct pnfs_layout_hdr *layout;
>> +     struct nfs4_deviceid *deviceid;
>>  };
>>
>> +#define NFS4_PNFS_DEVICEID4_SIZE 16
>> +
>>  #ifdef CONFIG_NFS_V4_1
>>
>>  #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
>> @@ -38,17 +41,86 @@ struct pnfs_layout_hdr {
>>       int                     refcount;
>>       struct list_head        layouts;   /* other client layouts */
>>       struct list_head        segs;      /* layout segments list */
>> +     seqlock_t               seqlock;   /* Protects the stateid */
>> +     nfs4_stateid            stateid;
>>       unsigned long           state;
>>       struct inode            *inode;
>>  };
>>
>>  /* Layout driver I/O operations. */
>>  struct layoutdriver_io_operations {
>> +     struct pnfs_layout_segment * (*alloc_lseg) (struct pnfs_layout_hdr *layoutid, struct nfs4_layoutget_res *lgr);
>> +     void (*free_lseg) (struct pnfs_layout_segment *lseg);
>> +
>>       /* Registration information for a new mounted file system */
>>       int (*initialize_mountpoint) (struct nfs_client *);
>>       int (*uninitialize_mountpoint) (struct nfs_client *);
>>  };
>>
>> +struct pnfs_deviceid {
>> +     char data[NFS4_PNFS_DEVICEID4_SIZE];
>> +};
>> +
>> +struct pnfs_device {
>> +     struct pnfs_deviceid dev_id;
>> +     unsigned int  layout_type;
>> +     unsigned int  mincount;
>> +     struct page **pages;
>> +     void          *area;
>> +     unsigned int  pgbase;
>> +     unsigned int  pglen;
>> +     unsigned int  dev_notify_types;
>> +};
>> +
>> +/*
>> + * Device ID RCU cache. A device ID is unique per client ID and layout type.
>> + */
>> +#define NFS4_DEVICE_ID_HASH_BITS     5
>> +#define NFS4_DEVICE_ID_HASH_SIZE     (1 << NFS4_DEVICE_ID_HASH_BITS)
>> +#define NFS4_DEVICE_ID_HASH_MASK     (NFS4_DEVICE_ID_HASH_SIZE - 1)
>> +
>> +static inline u32
>> +nfs4_deviceid_hash(struct pnfs_deviceid *id)
>> +{
>> +     unsigned char *cptr = (unsigned char *)id->data;
>> +     unsigned int nbytes = NFS4_PNFS_DEVICEID4_SIZE;
>> +     u32 x = 0;
>> +
>> +     while (nbytes--) {
>> +             x *= 37;
>> +             x += *cptr++;
>> +     }
>> +     return x & NFS4_DEVICE_ID_HASH_MASK;
>> +}
>> +
>> +struct nfs4_deviceid_cache {
>> +     spinlock_t              dc_lock;
>> +     atomic_t                dc_ref;
>> +     void                    (*dc_free_callback)(struct nfs4_deviceid *);
>> +     struct hlist_head       dc_deviceids[NFS4_DEVICE_ID_HASH_SIZE];
>> +     struct hlist_head       dc_to_free;
>> +};
>> +
>> +/* Device ID cache node */
>> +struct nfs4_deviceid {
>> +     struct hlist_node       de_node;
>> +     struct pnfs_deviceid    de_id;
>> +     atomic_t                de_ref;
>> +};
>> +
>> +extern int nfs4_alloc_init_deviceid_cache(struct nfs_client *,
>> +                             void (*free_callback)(struct nfs4_deviceid *));
>> +extern void nfs4_put_deviceid_cache(struct nfs_client *);
>> +extern void nfs4_init_deviceid_node(struct nfs4_deviceid *);
>> +extern struct nfs4_deviceid *nfs4_find_get_deviceid(
>> +                             struct nfs4_deviceid_cache *,
>> +                             struct pnfs_deviceid *);
>> +extern struct nfs4_deviceid *nfs4_add_deviceid(struct nfs4_deviceid_cache *,
>> +                             struct nfs4_deviceid *);
>> +extern void nfs4_set_layout_deviceid(struct pnfs_layout_segment *,
>> +                             struct nfs4_deviceid *);
>> +extern void nfs4_put_layout_deviceid(struct pnfs_layout_segment *);
>> +
>>  extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
>>  extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
>>
>> @@ -58,13 +130,30 @@ PNFS_NFS_SERVER(struct pnfs_layout_hdr *lo)
>>       return NFS_SERVER(lo->inode);
>>  }
>>
>> +static inline struct layoutdriver_io_operations *
>> +PNFS_LD_IO_OPS(struct pnfs_layout_hdr *lo)
>> +{
>> +     return PNFS_NFS_SERVER(lo)->pnfs_curr_ld->ld_io_ops;
>> +}
>> +
>> +/* nfs4proc.c */
>> +extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
>> +                                struct pnfs_device *dev);
>> +extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
>> +
>> +/* pnfs.c */
>>  struct pnfs_layout_segment *
>>  pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
>>                  enum pnfs_iomode access_type);
>>  void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
>>  void unset_pnfs_layoutdriver(struct nfs_server *);
>> +int pnfs_layout_process(struct nfs4_layoutget *lgp);
>> +void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
>> +                          const nfs4_stateid *stateid);
>>  void pnfs_destroy_layout(struct nfs_inode *);
>>  void pnfs_destroy_all_layouts(struct nfs_client *);
>> +void put_layout_hdr(struct inode *inode);
>> +void pnfs_get_layout_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo);
>>
>>
>>  static inline int lo_fail_bit(u32 iomode)
>> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
>> index 2dde7c8..dcdd11c 100644
>> --- a/include/linux/nfs4.h
>> +++ b/include/linux/nfs4.h
>> @@ -545,6 +545,8 @@ enum {
>>       NFSPROC4_CLNT_SEQUENCE,
>>       NFSPROC4_CLNT_GET_LEASE_TIME,
>>       NFSPROC4_CLNT_RECLAIM_COMPLETE,
>> +     NFSPROC4_CLNT_LAYOUTGET,
>> +     NFSPROC4_CLNT_GETDEVICEINFO,
>>  };
>>
>>  /* nfs41 types */
>> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
>> index e670a9c..7512886 100644
>> --- a/include/linux/nfs_fs_sb.h
>> +++ b/include/linux/nfs_fs_sb.h
>> @@ -83,6 +83,7 @@ struct nfs_client {
>>       u32                     cl_exchange_flags;
>>       struct nfs4_session     *cl_session;    /* sharred session */
>>       struct list_head        cl_layouts;
>> +     struct nfs4_deviceid_cache *cl_devid_cache; /* pNFS deviceid cache */
>>  #endif /* CONFIG_NFS_V4_1 */
>>
>>  #ifdef CONFIG_NFS_FSCACHE
>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
>> index 8a2c228..c4c6a61 100644
>> --- a/include/linux/nfs_xdr.h
>> +++ b/include/linux/nfs_xdr.h
>> @@ -186,6 +186,55 @@ struct nfs4_get_lease_time_res {
>>       struct nfs4_sequence_res        lr_seq_res;
>>  };
>>
>> +#define PNFS_LAYOUT_MAXSIZE 4096
>> +
>> +struct nfs4_layoutdriver_data {
>> +     __u32 len;
>> +     void *buf;
>> +};
>> +
>> +struct pnfs_layout_range {
>> +     u32 iomode;
>> +     u64 offset;
>> +     u64 length;
>> +};
>> +
>> +struct nfs4_layoutget_args {
>> +     __u32 type;
>> +     struct pnfs_layout_range range;
>> +     __u64 minlength;
>> +     __u32 maxcount;
>> +     struct inode *inode;
>> +     struct nfs_open_context *ctx;
>> +     struct nfs4_sequence_args seq_args;
>> +};
>> +
>> +struct nfs4_layoutget_res {
>> +     __u32 return_on_close;
>> +     struct pnfs_layout_range range;
>> +     __u32 type;
>> +     nfs4_stateid stateid;
>> +     struct nfs4_layoutdriver_data layout;
>> +     struct nfs4_sequence_res seq_res;
>> +};
>> +
>> +struct nfs4_layoutget {
>> +     struct nfs4_layoutget_args args;
>> +     struct nfs4_layoutget_res res;
>> +     struct pnfs_layout_segment **lsegpp;
>> +     int status;
>> +};
>> +
>> +struct nfs4_getdeviceinfo_args {
>> +     struct pnfs_device *pdev;
>> +     struct nfs4_sequence_args seq_args;
>> +};
>> +
>> +struct nfs4_getdeviceinfo_res {
>> +     struct pnfs_device *pdev;
>> +     struct nfs4_sequence_res seq_res;
>> +};
>> +
>>  /*
>>   * Arguments to the open call.
>>   */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-10 21:11     ` Fred Isaman
@ 2010-09-10 22:37       ` Trond Myklebust
  2010-09-13 10:32         ` Benny Halevy
  2010-09-13 14:48         ` Christoph Hellwig
  2010-09-13 10:16       ` Benny Halevy
  1 sibling, 2 replies; 55+ messages in thread
From: Trond Myklebust @ 2010-09-10 22:37 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Fri, 2010-09-10 at 14:11 -0700, Fred Isaman wrote:
> On Fri, Sep 10, 2010 at 12:31 PM, Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
> > On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
> OK
> 
> >> +     .initialize_mountpoint   = filelayout_initialize_mountpoint,
> >> +     .uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
> >> +};
> >> +
> >> +
> >> +struct pnfs_layoutdriver_type filelayout_type = {
> >
> > Ditto.
> 
> This includes a list_head field which is set by the generic layer.
> 
> >
> >> +     .id = LAYOUT_NFSV4_1_FILES,
> >> +     .name = "LAYOUT_NFSV4_1_FILES",
> >> +     .ld_io_ops = &filelayout_io_operations,
> >
> > Why do we need a separate 'struct layoutdriver_io_operations'? Any
> > reason those can't just be embedded in struct pnfs_layoutdriver_type?
> 
> I believe this decision was primarily aesthetics.  However, keeping
> the static io_ops seperate from the variable list_head seems like a
> good idea.

I dunno. They are in a 1-1 correspondence, so I'm not sure I see the
need for a separation.

> Perhaps having a driver structure that includes the io_ops and static
> portions of pnfs_layoutdriver_type, with the generic layer allocating
> a wrapper structure that is basically:
> struct {
>     struct list_head list;
>     struct pnfs_layoutdriver_type *driver_info;

      Should be const...

      struct module *owner = THIS_MODULE;

> }

...although the struct module could probably indeed be part of
pnfs_layoutdriver_type too.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
  2010-09-10 21:47     ` Fred Isaman
@ 2010-09-10 22:43       ` Trond Myklebust
  2010-09-13 14:16       ` Benny Halevy
  1 sibling, 0 replies; 55+ messages in thread
From: Trond Myklebust @ 2010-09-10 22:43 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Fri, 2010-09-10 at 14:47 -0700, Fred Isaman wrote:
> On Fri, Sep 10, 2010 at 1:11 PM, Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
> > On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
> >> From: The pNFS Team <linux-nfs@vger.kernel.org>
> >>  static void
> >>  init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
> >>  {
> >> @@ -191,7 +210,7 @@ destroy_lseg(struct kref *kref)
> >>       struct pnfs_layout_hdr *local = lseg->layout;
> >>
> >>       dprintk("--> %s\n", __func__);
> >> -     kfree(lseg);
> >> +     PNFS_LD_IO_OPS(local)->free_lseg(lseg);
> >
> > Where is PNFS_LD_IO_OPS() defined? Besides, I thought we agreed to get
> > rid of that.
> 
> This is defined in pnfs.h as
> PNFS_NFS_SERVER()->pnfs_curr_ld->ld_io_iops, mainly to save typing.

It may save typing in the short term, but long-term it mainly serves to
hide the various levels of indirection. I'd prefer to have the latter
obvious to people in order to encourage them to think more carefully
about how to avoid recalculating these values over and over again.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-10 19:31   ` Trond Myklebust
  2010-09-10 21:11     ` Fred Isaman
@ 2010-09-10 23:56     ` Christoph Hellwig
  2010-09-11  0:03       ` Trond Myklebust
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2010-09-10 23:56 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Fred Isaman, linux-nfs

On Fri, Sep 10, 2010 at 03:31:51PM -0400, Trond Myklebust wrote:
> > +	tristate
> > +	depends on NFS_FS && NFS_V4_1
> > +	default m
> 
> Should be 'default y', otherwise it has an implicit dependency on
> CONFIG_MODULES.

No, it should not have a default statement at all.  The only reason to
put in a default statement is to keep existing code working when it's
split into multiple options, which this is not.  This is not just my
opinion, btw - Linus has frequently whacked people for introducing pointless
defaults in the past.

And even if it was okay sometimes pnfs nowhere is near important enough
to add it in Kconfig.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-02 18:00 ` [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure Fred Isaman
  2010-09-10 19:23   ` Trond Myklebust
@ 2010-09-10 23:58   ` Christoph Hellwig
  2010-09-11  0:07     ` Trond Myklebust
  2010-09-13 15:07   ` Christoph Hellwig
  2 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2010-09-10 23:58 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

> +EXPORT_SYMBOL(pnfs_register_layoutdriver);

Al exports from nfs.ko needs to be _GPL - this is in no way a public
API, just an internal subdivision of the nfs client.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-10 23:56     ` Christoph Hellwig
@ 2010-09-11  0:03       ` Trond Myklebust
  2010-09-11  0:07         ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: Trond Myklebust @ 2010-09-11  0:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Fred Isaman, linux-nfs

On Fri, 2010-09-10 at 19:56 -0400, Christoph Hellwig wrote:
> On Fri, Sep 10, 2010 at 03:31:51PM -0400, Trond Myklebust wrote:
> > > +	tristate
> > > +	depends on NFS_FS && NFS_V4_1
> > > +	default m
> > 
> > Should be 'default y', otherwise it has an implicit dependency on
> > CONFIG_MODULES.
> 
> No, it should not have a default statement at all.  The only reason to
> put in a default statement is to keep existing code working when it's
> split into multiple options, which this is not.  This is not just my
> opinion, btw - Linus has frequently whacked people for introducing pointless
> defaults in the past.
> 
> And even if it was okay sometimes pnfs nowhere is near important enough
> to add it in Kconfig.

So you are saying we should simply equate CONFIG_PNFS_FILE_LAYOUT and
CONFIG_NFS_V4_1 right now? Yep, I'd be fine with that... I'm still
working on the patches to get rid of all these CONFIG options, but
ultimately this is what I'm working towards.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-11  0:03       ` Trond Myklebust
@ 2010-09-11  0:07         ` Christoph Hellwig
  2010-09-11  0:13           ` Trond Myklebust
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2010-09-11  0:07 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Christoph Hellwig, Fred Isaman, linux-nfs

On Fri, Sep 10, 2010 at 08:03:53PM -0400, Trond Myklebust wrote:
> So you are saying we should simply equate CONFIG_PNFS_FILE_LAYOUT and
> CONFIG_NFS_V4_1 right now? Yep, I'd be fine with that... I'm still
> working on the patches to get rid of all these CONFIG options, but
> ultimately this is what I'm working towards.

If you don't want a separate option that's up to you, but I don't think
forcing people to built the pnfs file layout just because they want nfs4.1
features is an all that smart idea.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-10 23:58   ` Christoph Hellwig
@ 2010-09-11  0:07     ` Trond Myklebust
  2010-09-13 11:24       ` Benny Halevy
  0 siblings, 1 reply; 55+ messages in thread
From: Trond Myklebust @ 2010-09-11  0:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Fred Isaman, linux-nfs

On Fri, 2010-09-10 at 19:58 -0400, Christoph Hellwig wrote:
> > +EXPORT_SYMBOL(pnfs_register_layoutdriver);
> 
> Al exports from nfs.ko needs to be _GPL - this is in no way a public
> API, just an internal subdivision of the nfs client.

ACK. We're not committing to supporting a stable ABI here in any way,
shape or form...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-11  0:07         ` Christoph Hellwig
@ 2010-09-11  0:13           ` Trond Myklebust
  2010-09-13 11:28             ` Benny Halevy
  0 siblings, 1 reply; 55+ messages in thread
From: Trond Myklebust @ 2010-09-11  0:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Fred Isaman, linux-nfs

On Fri, 2010-09-10 at 20:07 -0400, Christoph Hellwig wrote:
> On Fri, Sep 10, 2010 at 08:03:53PM -0400, Trond Myklebust wrote:
> > So you are saying we should simply equate CONFIG_PNFS_FILE_LAYOUT and
> > CONFIG_NFS_V4_1 right now? Yep, I'd be fine with that... I'm still
> > working on the patches to get rid of all these CONFIG options, but
> > ultimately this is what I'm working towards.
> 
> If you don't want a separate option that's up to you, but I don't think
> forcing people to built the pnfs file layout just because they want nfs4.1
> features is an all that smart idea.

IMHO it should be fine.

Most people will be compiling NFS as a module, in which case, the pnfs
file layout is just another module that can be left out from the final
binary if people don't want it.

I'm still waiting to hear from people who want to compile NFSv4.1 in the
main kernel, but who want pNFS to be modularised.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-10 21:11     ` Fred Isaman
  2010-09-10 22:37       ` Trond Myklebust
@ 2010-09-13 10:16       ` Benny Halevy
  1 sibling, 0 replies; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 10:16 UTC (permalink / raw)
  To: Fred Isaman; +Cc: Trond Myklebust, linux-nfs

On 2010-09-11 00:11, Fred Isaman wrote:
> On Fri, Sep 10, 2010 at 12:31 PM, Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
>> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>>>
>>> This driver just registers itself and supplies trivial mount/umount functions.
>>>
>>> Signed-off-by: TBD - melding/reorganization of several patches
>>> ---
>>>  fs/nfs/Kconfig          |    5 +++
>>>  fs/nfs/Makefile         |    3 ++
>>>  fs/nfs/nfs4filelayout.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/nfs_fs.h  |    1 +
>>>  4 files changed, 98 insertions(+), 0 deletions(-)
>>>  create mode 100644 fs/nfs/nfs4filelayout.c
>>>
>>> diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
>>> index 5f1b936..980f2dc 100644
>>> --- a/fs/nfs/Kconfig
>>> +++ b/fs/nfs/Kconfig
>>> @@ -82,6 +82,11 @@ config NFS_V4_1
>>>
>>>         If unsure, say N.
>>>
>>> +config PNFS_FILE_LAYOUT
>>> +     tristate
>>> +     depends on NFS_FS && NFS_V4_1
>>> +     default m
>>
>> Should be 'default y', otherwise it has an implicit dependency on
>> CONFIG_MODULES.
>>
> 
> The idea was that normally the driver would compile as a module, and
> use loading/unloading of it to control whether pnfs is supported.
> 
> Is there a way to do this that does not introduce the implicit dependency?
> 

The explicit dependency on NFS_FS does the trick for you (as it is currently in the
pnfs tree), so the default is set to m iff NFS_FS==m

Benny

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-10 22:37       ` Trond Myklebust
@ 2010-09-13 10:32         ` Benny Halevy
  2010-09-13 13:01           ` Fred Isaman
  2010-09-13 14:48         ` Christoph Hellwig
  1 sibling, 1 reply; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 10:32 UTC (permalink / raw)
  To: Trond Myklebust, Fred Isaman; +Cc: linux-nfs

On 2010-09-11 01:37, Trond Myklebust wrote:
> On Fri, 2010-09-10 at 14:11 -0700, Fred Isaman wrote:
>> On Fri, Sep 10, 2010 at 12:31 PM, Trond Myklebust
>> <Trond.Myklebust@netapp.com> wrote:
>>> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>> OK
>>
>>>> +     .initialize_mountpoint   = filelayout_initialize_mountpoint,
>>>> +     .uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
>>>> +};
>>>> +
>>>> +
>>>> +struct pnfs_layoutdriver_type filelayout_type = {
>>>
>>> Ditto.
>>
>> This includes a list_head field which is set by the generic layer.
>>
>>>
>>>> +     .id = LAYOUT_NFSV4_1_FILES,
>>>> +     .name = "LAYOUT_NFSV4_1_FILES",
>>>> +     .ld_io_ops = &filelayout_io_operations,
>>>
>>> Why do we need a separate 'struct layoutdriver_io_operations'? Any
>>> reason those can't just be embedded in struct pnfs_layoutdriver_type?
>>
>> I believe this decision was primarily aesthetics.  However, keeping
>> the static io_ops seperate from the variable list_head seems like a
>> good idea.
> 
> I dunno. They are in a 1-1 correspondence, so I'm not sure I see the
> need for a separation.
> 

Later in the game we introduce the layout driver policy ops.
That said, they could be added to the same vector as the io ops.

>> Perhaps having a driver structure that includes the io_ops and static
>> portions of pnfs_layoutdriver_type, with the generic layer allocating
>> a wrapper structure that is basically:
>> struct {
>>     struct list_head list;
>>     struct pnfs_layoutdriver_type *driver_info;
> 
>       Should be const...
> 
>       struct module *owner = THIS_MODULE;
> 
>> }
> 
> ...although the struct module could probably indeed be part of
> pnfs_layoutdriver_type too.

Agreed.
I think we should just have

struct pnfs_layoutdriver_type {
	struct list_head pnfs_tblid;
	const u32 id;
	const char *name;
	const struct module *owner;
	const struct layoutdriver_operations *ld_ops;
 };

Benny

> 
> Cheers
>   Trond
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-10 19:23   ` Trond Myklebust
       [not found]     ` <1284146604.10062.68.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2010-09-13 11:06     ` Boaz Harrosh
  2010-09-13 14:44       ` Christoph Hellwig
  2010-09-13 11:20     ` Benny Halevy
  2 siblings, 1 reply; 55+ messages in thread
From: Boaz Harrosh @ 2010-09-13 11:06 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Fred Isaman, linux-nfs

On 09/10/2010 10:23 PM, Trond Myklebust wrote:
> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>>  
>>  /* Unitialize a mountpoint in a layout driver */
>>  void
>>  unset_pnfs_layoutdriver(struct nfs_server *nfss)
>>  {
>> +	if (nfss->pnfs_curr_ld)
>> +		nfss->pnfs_curr_ld->ld_io_ops->uninitialize_mountpoint(nfss->nfs_client);
> 
> That 'uninitialize_mountpoint' name doesn't make any sense. The
> nfs_client parameter isn't associated to a particular mountpoint.
> 

Me two BTW.
I will later need a per super-block resources, mainly slabs and work_q.
The above is used for two things: 
one - Per-client but after the layout-type is known which comes after first
      mount. (For the device cache)
Two - Per server (per mount) which will be used by objects later on.
      (I think that there is an if somewhere that make this call only
       once per client, right?)

I think we should just pass nfss to the LD and let it take care of accessing
nfss->nfs_client for the device cache. (It is done so later in the tree, right?)

Boaz
>>  	nfss->pnfs_curr_ld = NULL;
>>  }
>>  
>> @@ -68,6 +100,12 @@ set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
>>  			goto out_no_driver;
>>  		}
>>  	}
>> +	if (ld_type->ld_io_ops->initialize_mountpoint(server->nfs_client)) {
> 
> Ditto.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-10 19:23   ` Trond Myklebust
       [not found]     ` <1284146604.10062.68.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2010-09-13 11:06     ` Boaz Harrosh
@ 2010-09-13 11:20     ` Benny Halevy
  2 siblings, 0 replies; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 11:20 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Fred Isaman, linux-nfs

On 2010-09-10 22:23, Trond Myklebust wrote:
> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>>
>> Allow a module implementing a layout type to register, and
>> have its mount/umount routines called for filesystems that
>> the server declares support it.
>>
>> Signed-off-by: TBD - melding/reorganization of several patches
>> ---
>>  Documentation/filesystems/nfs/00-INDEX |    2 +
>>  Documentation/filesystems/nfs/pnfs.txt |   48 +++++++++++++++++++
>>  fs/nfs/Kconfig                         |    2 +-
>>  fs/nfs/pnfs.c                          |   79 +++++++++++++++++++++++++++++++-
>>  fs/nfs/pnfs.h                          |   14 ++++++
>>  5 files changed, 142 insertions(+), 3 deletions(-)
>>  create mode 100644 Documentation/filesystems/nfs/pnfs.txt
>>
>> diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX
>> index 2f68cd6..8d930b9 100644
>> --- a/Documentation/filesystems/nfs/00-INDEX
>> +++ b/Documentation/filesystems/nfs/00-INDEX
>> @@ -12,5 +12,7 @@ nfs-rdma.txt
>>  	- how to install and setup the Linux NFS/RDMA client and server software
>>  nfsroot.txt
>>  	- short guide on setting up a diskless box with NFS root filesystem.
>> +pnfs.txt
>> +	- short explanation of some of the internals of the pnfs code
>>  rpc-cache.txt
>>  	- introduction to the caching mechanisms in the sunrpc layer.
>> diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt
>> new file mode 100644
>> index 0000000..bc0b9cf
>> --- /dev/null
>> +++ b/Documentation/filesystems/nfs/pnfs.txt
>> @@ -0,0 +1,48 @@
>> +Reference counting in pnfs:
>> +==========================
>> +
>> +The are several inter-related caches.  We have layouts which can
>> +reference multiple devices, each of which can reference multiple data servers.
>> +Each data server can be referenced by multiple devices.  Each device
>> +can be referenced by multiple layouts.  To keep all of this straight,
>> +we need to reference count.
>> +
>> +
>> +struct pnfs_layout_hdr
>> +----------------------
>> +The on-the-wire command LAYOUTGET corresponds to struct
>> +pnfs_layout_segment, usually referred to by the variable name lseg.
>> +Each nfs_inode may hold a pointer to a cache of of these layout
>> +segments in nfsi->layout, of type struct pnfs_layout_hdr.
>> +
>> +We reference the header for the inode pointing to it, across each
>> +outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
>> +LAYOUTCOMMIT), and for each lseg held within.
>> +
>> +Each header is also (when non-empty) put on a list associated with
>> +struct nfs_client (cl_layouts).  Being put on this list does not bump
>> +the reference count, as the layout is kept around by the lseg that
>> +keeps it in the list.
>> +
>> +deviceid_cache
>> +--------------
>> +lsegs reference device ids, which are resolved per nfs_client and
>> +layout driver type.  The device ids are held in a RCU cache (struct
>> +nfs4_deviceid_cache).  The cache itself is referenced across each
>> +mount.  The entries (struct nfs4_deviceid) themselves are held across
>> +the lifetime of each lseg referencing them.
>> +
>> +RCU is used because the deviceid is basically a write once, read many
>> +data structure.  The hlist size of 32 buckets needs better
>> +justification, but seems reasonable given that we can have multiple
>> +deviceid's per filesystem, and multiple filesystems per nfs_client.
>> +
>> +The hash code is copied from the nfsd code base.  A discussion of
>> +hashing and variations of this algorithm can be found at:
>> +http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
>> +
>> +data server cache
>> +-----------------
>> +file driver devices refer to data servers, which are kept in a module
>> +level cache.  Its reference is held over the lifetime of the deviceid
>> +pointing to it.
>> diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
>> index 6c2aad4..5f1b936 100644
>> --- a/fs/nfs/Kconfig
>> +++ b/fs/nfs/Kconfig
>> @@ -78,7 +78,7 @@ config NFS_V4_1
>>  	depends on NFS_V4 && EXPERIMENTAL
>>  	help
>>  	  This option enables support for minor version 1 of the NFSv4 protocol
>> -	  (draft-ietf-nfsv4-minorversion1) in the kernel's NFS client.
>> +	  (RFC 5661) in the kernel's NFS client.
>>  
>>  	  If unsure, say N.
>>  
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 2e5dba1..8d503fc 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -32,16 +32,48 @@
>>  
>>  #define NFSDBG_FACILITY		NFSDBG_PNFS
>>  
>> -/* STUB that returns the equivalent of "no module found" */
>> +/* Locking:
>> + *
>> + * pnfs_spinlock:
>> + *      protects pnfs_modules_tbl.
>> + */
>> +static DEFINE_SPINLOCK(pnfs_spinlock);
>> +
>> +/*
>> + * pnfs_modules_tbl holds all pnfs modules
>> + */
>> +static LIST_HEAD(pnfs_modules_tbl);
>> +
>> +/* Return the registered pnfs layout driver module matching given id */
>> +static struct pnfs_layoutdriver_type *
>> +find_pnfs_driver_locked(u32 id) {
>> +	struct  pnfs_layoutdriver_type *local;
>> +
>> +	dprintk("PNFS: %s: Searching for %u\n", __func__, id);
>> +	list_for_each_entry(local, &pnfs_modules_tbl, pnfs_tblid)
>> +		if (local->id == id)
>> +			goto out;
>> +	local = NULL;
>> +out:
>> +	return local;
>> +}
>> +
>>  static struct pnfs_layoutdriver_type *
>>  find_pnfs_driver(u32 id) {
>> -	return NULL;
>> +	struct  pnfs_layoutdriver_type *local;
>> +
>> +	spin_lock(&pnfs_spinlock);
>> +	local = find_pnfs_driver_locked(id);
> 
> Don't you want some kind of reference count on this? I'd assume that you
> probably need a module_get() with a corresponding module_put() when you
> are done using the layoutdriver.
> 
>> +	spin_unlock(&pnfs_spinlock);
>> +	return local;
>>  }
>>  
>>  /* Unitialize a mountpoint in a layout driver */
>>  void
>>  unset_pnfs_layoutdriver(struct nfs_server *nfss)
>>  {
>> +	if (nfss->pnfs_curr_ld)
>> +		nfss->pnfs_curr_ld->ld_io_ops->uninitialize_mountpoint(nfss->nfs_client);
> 
> That 'uninitialize_mountpoint' name doesn't make any sense. The
> nfs_client parameter isn't associated to a particular mountpoint.
> 

We call these methods upon creating and destroying the nfs_server,
respectively. Later on, in the post-submit world, we change this parameter
to a struct nfs_server * for the blocks layout driver.
The motivation is to issue GETDEVICELIST at mount time.

For the file layout at its present state we only use the nfs_client
for the deviceid cache.  Note that to support multiple layout types
per server (possibly for different filesystems exported by that server
we'll need per-layouttype deviceid cache on the nfs_client.

We can have different methods for the per-nfs_client event
and the per-nfs_server event and call them correspondingly.

Benny

>>  	nfss->pnfs_curr_ld = NULL;
>>  }
>>  
>> @@ -68,6 +100,12 @@ set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
>>  			goto out_no_driver;
>>  		}
>>  	}
>> +	if (ld_type->ld_io_ops->initialize_mountpoint(server->nfs_client)) {
> 
> Ditto.
> 
>> +		printk(KERN_ERR
>> +		       "%s: Error initializing mount point for layout driver %u.\n",
>> +		       __func__, id);
>> +		goto out_no_driver;
>> +	}
>>  	server->pnfs_curr_ld = ld_type;
>>  	dprintk("%s: pNFS module for %u set\n", __func__, id);
>>  	return;
>> @@ -76,3 +114,40 @@ out_no_driver:
>>  	dprintk("%s: Using NFSv4 I/O\n", __func__);
>>  	server->pnfs_curr_ld = NULL;
>>  }
>> +
>> +int
>> +pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>> +{
>> +	struct layoutdriver_io_operations *io_ops = ld_type->ld_io_ops;
>> +	int status = -EINVAL;
>> +
>> +	if (!io_ops) {
>> +		printk(KERN_ERR "%s Layout driver must provide io_ops\n",
>> +			__func__);
>> +		return status;
>> +	}
>> +
>> +	spin_lock(&pnfs_spinlock);
>> +	if (!find_pnfs_driver_locked(ld_type->id)) {
>> +		list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
>> +		status = 0;
>> +		dprintk("%s Registering id:%u name:%s\n", __func__, ld_type->id,
>> +			ld_type->name);
>> +	} else
>> +		printk(KERN_ERR "%s Module with id %d already loaded!\n",
>> +			__func__, ld_type->id);
>> +	spin_unlock(&pnfs_spinlock);
>> +
>> +	return status;
>> +}
>> +EXPORT_SYMBOL(pnfs_register_layoutdriver);
>> +
>> +void
>> +pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>> +{
>> +	dprintk("%s Deregistering id:%u\n", __func__, ld_type->id);
>> +	spin_lock(&pnfs_spinlock);
>> +	list_del(&ld_type->pnfs_tblid);
>> +	spin_unlock(&pnfs_spinlock);
>> +}
>> +EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index 3281fbf..9049b9a 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -16,8 +16,22 @@
>>  
>>  /* Per-layout driver specific registration structure */
>>  struct pnfs_layoutdriver_type {
>> +	struct list_head pnfs_tblid;
>> +	const u32 id;
>> +	const char *name;
>> +	struct layoutdriver_io_operations *ld_io_ops;
>>  };
>>  
>> +/* Layout driver I/O operations. */
>> +struct layoutdriver_io_operations {
>> +	/* Registration information for a new mounted file system */
>> +	int (*initialize_mountpoint) (struct nfs_client *);
>> +	int (*uninitialize_mountpoint) (struct nfs_client *);
>> +};
>> +
>> +extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
>> +extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
>> +
>>  void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
>>  void unset_pnfs_layoutdriver(struct nfs_server *);
>>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-11  0:07     ` Trond Myklebust
@ 2010-09-13 11:24       ` Benny Halevy
  2010-09-13 12:29         ` Trond Myklebust
  2010-09-13 14:28         ` Christoph Hellwig
  0 siblings, 2 replies; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 11:24 UTC (permalink / raw)
  To: Trond Myklebust, Christoph Hellwig; +Cc: Fred Isaman, linux-nfs

On 2010-09-11 03:07, Trond Myklebust wrote:
> On Fri, 2010-09-10 at 19:58 -0400, Christoph Hellwig wrote:
>>> +EXPORT_SYMBOL(pnfs_register_layoutdriver);
>>
>> Al exports from nfs.ko needs to be _GPL - this is in no way a public
>> API, just an internal subdivision of the nfs client.
> 
> ACK. We're not committing to supporting a stable ABI here in any way,
> shape or form...

OK, not yet.

However, on the longer run I'd like us to consider formalizing
the kABI for non-GPLed layout drivers.

I think that this is a great selling point as it fully materializes the
extensibility of the layout-type / layout-driver design model.

Benny

> 
> Cheers
>   Trond
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-11  0:13           ` Trond Myklebust
@ 2010-09-13 11:28             ` Benny Halevy
  0 siblings, 0 replies; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 11:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Christoph Hellwig, Fred Isaman, linux-nfs

On 2010-09-11 03:13, Trond Myklebust wrote:
> On Fri, 2010-09-10 at 20:07 -0400, Christoph Hellwig wrote:
>> On Fri, Sep 10, 2010 at 08:03:53PM -0400, Trond Myklebust wrote:
>>> So you are saying we should simply equate CONFIG_PNFS_FILE_LAYOUT and
>>> CONFIG_NFS_V4_1 right now? Yep, I'd be fine with that... I'm still
>>> working on the patches to get rid of all these CONFIG options, but
>>> ultimately this is what I'm working towards.
>>
>> If you don't want a separate option that's up to you, but I don't think
>> forcing people to built the pnfs file layout just because they want nfs4.1
>> features is an all that smart idea.
> 
> IMHO it should be fine.
> 
> Most people will be compiling NFS as a module, in which case, the pnfs
> file layout is just another module that can be left out from the final
> binary if people don't want it.
> 
> I'm still waiting to hear from people who want to compile NFSv4.1 in the
> main kernel, but who want pNFS to be modularised.

If a user uses a different layout driver and not the files layout driver
than the latter is just extra baggage.  That said, the way Fred implemented
the Kconfig option, it cannot be disabled independently anyway so it's rather
pointless.

Benny

> 
> Cheers
>   Trond
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/13] RFC: nfs: create and destroy inode's layout cache
  2010-09-10 19:43   ` Trond Myklebust
       [not found]     ` <1284147785.10062.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2010-09-13 11:32     ` Benny Halevy
  1 sibling, 0 replies; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 11:32 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Fred Isaman, linux-nfs

On 2010-09-10 22:43, Trond Myklebust wrote:
> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>>
>> At the start of the io paths, try to grab the relevant layout
>> information.  This will initiate the inode's layout cache, but
>> stubs ensure the cache stays empty.
>>
>> Signed-off-by: TBD - melding/reorganization of several patches
>> ---
>>  fs/nfs/file.c          |    5 ++
>>  fs/nfs/inode.c         |    3 +
>>  fs/nfs/pnfs.c          |  140 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/nfs/pnfs.h          |   39 +++++++++++++
>>  fs/nfs/read.c          |    3 +
>>  include/linux/nfs_fs.h |    3 +
>>  6 files changed, 193 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index eb51bd6..10ebdfb 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -36,6 +36,7 @@
>>  #include "internal.h"
>>  #include "iostat.h"
>>  #include "fscache.h"
>> +#include "pnfs.h"
>>  
>>  #define NFSDBG_FACILITY		NFSDBG_FILE
>>  
>> @@ -386,6 +387,10 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
>>  		file->f_path.dentry->d_name.name,
>>  		mapping->host->i_ino, len, (long long) pos);
>>  
>> +	pnfs_update_layout(mapping->host,
>> +			   nfs_file_open_context(file),
>> +			   IOMODE_RW);
>> +
>>  start:
>>  	/*
>>  	 * Prevent starvation issues if someone is doing a consistency
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 7d2d6c7..0dc6dad 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -48,6 +48,7 @@
>>  #include "internal.h"
>>  #include "fscache.h"
>>  #include "dns_resolve.h"
>> +#include "pnfs.h"
>>  
>>  #define NFSDBG_FACILITY		NFSDBG_VFS
>>  
>> @@ -1409,6 +1410,7 @@ void nfs4_evict_inode(struct inode *inode)
>>  {
>>  	truncate_inode_pages(&inode->i_data, 0);
>>  	end_writeback(inode);
>> +	pnfs_destroy_layout(NFS_I(inode));
>>  	/* If we are holding a delegation, return it! */
>>  	nfs_inode_return_delegation_noreclaim(inode);
>>  	/* First call standard NFS clear_inode() code */
>> @@ -1446,6 +1448,7 @@ static inline void nfs4_init_once(struct nfs_inode *nfsi)
>>  	nfsi->delegation = NULL;
>>  	nfsi->delegation_state = 0;
>>  	init_rwsem(&nfsi->rwsem);
>> +	nfsi->layout = NULL;
>>  #endif
>>  }
>>  
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 8d503fc..65f923b 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -151,3 +151,143 @@ pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>>  	spin_unlock(&pnfs_spinlock);
>>  }
>>  EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
>> +
>> +static void
>> +get_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>> +{
>> +	assert_spin_locked(&lo->inode->i_lock);
>> +	lo->refcount++;
>> +}
>> +
>> +static void
>> +put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>> +{
>> +	assert_spin_locked(&lo->inode->i_lock);
>> +	BUG_ON(lo->refcount <= 0);
>> +
>> +	lo->refcount--;
>> +	if (!lo->refcount) {
>> +		dprintk("%s: freeing layout cache %p\n", __func__, lo);
>> +		NFS_I(lo->inode)->layout = NULL;
>> +		kfree(lo);
>> +	}
>> +}
>> +
>> +void
>> +pnfs_destroy_layout(struct nfs_inode *nfsi)
>> +{
>> +	struct pnfs_layout_hdr *lo;
>> +
>> +	spin_lock(&nfsi->vfs_inode.i_lock);
>> +	lo = nfsi->layout;
>> +	if (lo) {
>> +		/* Matched by refcount set to 1 in alloc_init_layout_hdr */
>> +		put_layout_hdr_locked(lo);
>> +	}
>> +	spin_unlock(&nfsi->vfs_inode.i_lock);
>> +}
>> +
>> +/* STUB - pretend LAYOUTGET to server failed */
>> +static struct pnfs_layout_segment *
>> +send_layoutget(struct pnfs_layout_hdr *lo,
>> +	   struct nfs_open_context *ctx,
>> +	   u32 iomode)
>> +{
>> +	struct inode *ino = lo->inode;
>> +
>> +	set_bit(lo_fail_bit(iomode), &lo->state);
>> +	spin_lock(&ino->i_lock);
>> +	put_layout_hdr_locked(lo);
>> +	spin_unlock(&ino->i_lock);
>> +	return NULL;
>> +}
>> +
>> +static struct pnfs_layout_hdr *
>> +alloc_init_layout_hdr(struct inode *ino)
>> +{
>> +	struct pnfs_layout_hdr *lo;
>> +
>> +	lo = kzalloc(sizeof(struct pnfs_layout_hdr), GFP_KERNEL);
>> +	if (!lo)
>> +		return NULL;
>> +	lo->refcount = 1;
>> +	lo->inode = ino;
>> +	return lo;
>> +}
>> +
>> +static struct pnfs_layout_hdr *
>> +pnfs_find_alloc_layout(struct inode *ino)
>> +{
>> +	struct nfs_inode *nfsi = NFS_I(ino);
>> +	struct pnfs_layout_hdr *new = NULL;
>> +
>> +	dprintk("%s Begin ino=%p layout=%p\n", __func__, ino, nfsi->layout);
>> +
>> +	assert_spin_locked(&ino->i_lock);
>> +	if (nfsi->layout)
>> +		return nfsi->layout;
>> +
>> +	spin_unlock(&ino->i_lock);
>> +	new = alloc_init_layout_hdr(ino);
>> +	spin_lock(&ino->i_lock);
>> +
>> +	if (likely(nfsi->layout == NULL))	/* Won the race? */
>> +		nfsi->layout = new;
>> +	else
>> +		kfree(new);
>> +	return nfsi->layout;
>> +}
>> +
>> +/* STUB - LAYOUTGET never succeeds, so cache is empty */
>> +static struct pnfs_layout_segment *
>> +pnfs_has_layout(struct pnfs_layout_hdr *lo, u32 iomode)
>> +{
>> +	return NULL;
>> +}
>> +
>> +/*
>> + * Layout segment is retreived from the server if not cached.
>> + * The appropriate layout segment is referenced and returned to the caller.
>> + */
>> +struct pnfs_layout_segment *
>> +pnfs_update_layout(struct inode *ino,
>> +		   struct nfs_open_context *ctx,
>> +		   enum pnfs_iomode iomode)
>> +{
>> +	struct nfs_inode *nfsi = NFS_I(ino);
>> +	struct pnfs_layout_hdr *lo;
>> +	struct pnfs_layout_segment *lseg = NULL;
>> +
>> +	if (!pnfs_enabled_sb(NFS_SERVER(ino)))
>> +		return NULL;
>> +	spin_lock(&ino->i_lock);
>> +	lo = pnfs_find_alloc_layout(ino);
>> +	if (lo == NULL) {
>> +		dprintk("%s ERROR: can't get pnfs_layout_hdr\n", __func__);
>> +		goto out_unlock;
>> +	}
>> +
>> +	/* Check to see if the layout for the given range already exists */
>> +	lseg = pnfs_has_layout(lo, iomode);
>> +	if (lseg) {
>> +		dprintk("%s: Using cached lseg %p for iomode %d)\n",
>> +			__func__, lseg, iomode);
>> +		goto out_unlock;
>> +	}
>> +
>> +	/* if LAYOUTGET already failed once we don't try again */
>> +	if (test_bit(lo_fail_bit(iomode), &nfsi->layout->state))
>> +		goto out_unlock;
>> +
>> +	get_layout_hdr_locked(lo);
>> +	spin_unlock(&ino->i_lock);
>> +
>> +	lseg = send_layoutget(lo, ctx, iomode);
>> +out:
>> +	dprintk("%s end, state 0x%lx lseg %p\n", __func__,
>> +		nfsi->layout->state, lseg);
>> +	return lseg;
>> +out_unlock:
>> +	spin_unlock(&ino->i_lock);
>> +	goto out;
>> +}
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index 9049b9a..b63b445 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -14,6 +14,11 @@
>>  
>>  #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
>>  
>> +enum {
>> +	NFS_LAYOUT_RO_FAILED = 0,	/* get ro layout failed stop trying */
>> +	NFS_LAYOUT_RW_FAILED,		/* get rw layout failed stop trying */
>> +};
>> +
>>  /* Per-layout driver specific registration structure */
>>  struct pnfs_layoutdriver_type {
>>  	struct list_head pnfs_tblid;
>> @@ -22,6 +27,12 @@ struct pnfs_layoutdriver_type {
>>  	struct layoutdriver_io_operations *ld_io_ops;
>>  };
>>  
>> +struct pnfs_layout_hdr {
>> +	int			refcount;
>         ^^^^^ Why not make this 'unsigned int', and/or 'unsigned long'?

Should be fine, we just need to be careful about underflow/overflow
before changing its value.

Benny

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-13 11:24       ` Benny Halevy
@ 2010-09-13 12:29         ` Trond Myklebust
  2010-09-13 14:37           ` Benny Halevy
  2010-09-13 14:28         ` Christoph Hellwig
  1 sibling, 1 reply; 55+ messages in thread
From: Trond Myklebust @ 2010-09-13 12:29 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Christoph Hellwig, Fred Isaman, linux-nfs

On Mon, 2010-09-13 at 13:24 +0200, Benny Halevy wrote:
> On 2010-09-11 03:07, Trond Myklebust wrote:
> > On Fri, 2010-09-10 at 19:58 -0400, Christoph Hellwig wrote:
> >>> +EXPORT_SYMBOL(pnfs_register_layoutdriver);
> >>
> >> Al exports from nfs.ko needs to be _GPL - this is in no way a public
> >> API, just an internal subdivision of the nfs client.
> > 
> > ACK. We're not committing to supporting a stable ABI here in any way,
> > shape or form...
> 
> OK, not yet.
> 
> However, on the longer run I'd like us to consider formalizing
> the kABI for non-GPLed layout drivers.
> 
> I think that this is a great selling point as it fully materializes the
> extensibility of the layout-type / layout-driver design model.

No. I'm not committing to a kabi, ever...

Trond


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-13 10:32         ` Benny Halevy
@ 2010-09-13 13:01           ` Fred Isaman
       [not found]             ` <AANLkTimONZfA6ZX4xtzbmy0NdfEtbwMAi+__PhFYznTn-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 55+ messages in thread
From: Fred Isaman @ 2010-09-13 13:01 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Trond Myklebust, linux-nfs

On Mon, Sep 13, 2010 at 3:32 AM, Benny Halevy <bhalevy@panasas.com> wrote:
> On 2010-09-11 01:37, Trond Myklebust wrote:
>> On Fri, 2010-09-10 at 14:11 -0700, Fred Isaman wrote:
>>> On Fri, Sep 10, 2010 at 12:31 PM, Trond Myklebust
>>> <Trond.Myklebust@netapp.com> wrote:
>>>> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>>> OK
>>>
>>>>> +     .initialize_mountpoint   = filelayout_initialize_mountpoint,
>>>>> +     .uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
>>>>> +};
>>>>> +
>>>>> +
>>>>> +struct pnfs_layoutdriver_type filelayout_type = {
>>>>
>>>> Ditto.
>>>
>>> This includes a list_head field which is set by the generic layer.
>>>
>>>>
>>>>> +     .id = LAYOUT_NFSV4_1_FILES,
>>>>> +     .name = "LAYOUT_NFSV4_1_FILES",
>>>>> +     .ld_io_ops = &filelayout_io_operations,
>>>>
>>>> Why do we need a separate 'struct layoutdriver_io_operations'? Any
>>>> reason those can't just be embedded in struct pnfs_layoutdriver_type?
>>>
>>> I believe this decision was primarily aesthetics.  However, keeping
>>> the static io_ops seperate from the variable list_head seems like a
>>> good idea.
>>
>> I dunno. They are in a 1-1 correspondence, so I'm not sure I see the
>> need for a separation.
>>
>
> Later in the game we introduce the layout driver policy ops.
> That said, they could be added to the same vector as the io ops.


Yes, I think they should be merged when we get to that stage.


>
>>> Perhaps having a driver structure that includes the io_ops and static
>>> portions of pnfs_layoutdriver_type, with the generic layer allocating
>>> a wrapper structure that is basically:
>>> struct {
>>>     struct list_head list;
>>>     struct pnfs_layoutdriver_type *driver_info;
>>
>>       Should be const...
>>
>>       struct module *owner = THIS_MODULE;
>>
>>> }
>>
>> ...although the struct module could probably indeed be part of
>> pnfs_layoutdriver_type too.
>
> Agreed.
> I think we should just have
>
> struct pnfs_layoutdriver_type {
>        struct list_head pnfs_tblid;
>        const u32 id;
>        const char *name;
>        const struct module *owner;
>        const struct layoutdriver_operations *ld_ops;
>  };
>
> Benny
>

I'll point out that what I took from the above conversation, was that
the fields id, name, and possibly owner should be placed inside struct
layoutdriver_operations (which should probably be renamed slightly).


>>
>> Cheers
>>   Trond
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
  2010-09-10 21:47     ` Fred Isaman
  2010-09-10 22:43       ` Trond Myklebust
@ 2010-09-13 14:16       ` Benny Halevy
  1 sibling, 0 replies; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 14:16 UTC (permalink / raw)
  To: Fred Isaman; +Cc: Trond Myklebust, linux-nfs

On 2010-09-11 00:47, Fred Isaman wrote:
> On Fri, Sep 10, 2010 at 1:11 PM, Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
>> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>>> From: The pNFS Team <linux-nfs@vger.kernel.org>
>>>
>>> Add the ability to actually send LAYOUTGET and GETDEVICEINFO.  This also adds
>>> in the machinery to handle layout state and the deviceid cache.  Note that
>>> GETDEVICEINFO is not called directly by the generic layer.  Instead it
>>> is called by the drivers while parsing the LAYOUTGET opaque data in response
>>> to an unknown device id embedded therein.  Annoyingly, RFC 5661 only encodes
>>> device ids within the driver-specific opaque data.
>>>
>>> Signed-off-by: TBD - melding/reorganization of several patches
>>> ---
>>>  fs/nfs/nfs4proc.c         |  134 ++++++++++++++++
>>>  fs/nfs/nfs4xdr.c          |  302 +++++++++++++++++++++++++++++++++++
>>>  fs/nfs/pnfs.c             |  382 ++++++++++++++++++++++++++++++++++++++++++---
>>>  fs/nfs/pnfs.h             |   91 +++++++++++-
>>>  include/linux/nfs4.h      |    2 +
>>>  include/linux/nfs_fs_sb.h |    1 +
>>>  include/linux/nfs_xdr.h   |   49 ++++++
>>>  7 files changed, 935 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>>> index c7c7277..7eeea0e 100644
>>> --- a/fs/nfs/nfs4proc.c
>>> +++ b/fs/nfs/nfs4proc.c
>>> @@ -55,6 +55,7 @@
>>>  #include "internal.h"
>>>  #include "iostat.h"
>>>  #include "callback.h"
>>> +#include "pnfs.h"
>>>
>>>  #define NFSDBG_FACILITY              NFSDBG_PROC
>>>
>>> @@ -5335,6 +5336,139 @@ out:
>>>       dprintk("<-- %s status=%d\n", __func__, status);
>>>       return status;
>>>  }
>>> +
>>> +static void
>>> +nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
>>> +{
>>> +     struct nfs4_layoutget *lgp = calldata;
>>> +     struct inode *ino = lgp->args.inode;
>>> +     struct nfs_server *server = NFS_SERVER(ino);
>>> +
>>> +     dprintk("--> %s\n", __func__);
>>> +     if (nfs4_setup_sequence(server, &lgp->args.seq_args,
>>> +                             &lgp->res.seq_res, 0, task))
>>> +             return;
>>> +     rpc_call_start(task);
>>> +}
>>> +
>>> +static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
>>> +{
>>> +     struct nfs4_layoutget *lgp = calldata;
>>> +     struct inode *ino = lgp->args.inode;
>>> +     struct nfs_server *server = NFS_SERVER(ino);
>>> +
>>> +     dprintk("--> %s\n", __func__);
>>> +
>>> +     if (!nfs4_sequence_done(task, &lgp->res.seq_res))
>>> +             return;
>>> +
>>> +     if (RPC_ASSASSINATED(task))
>>> +             return;
>>> +
>>> +     if (nfs4_async_handle_error(task, server, NULL) == -EAGAIN)
>>> +             nfs_restart_rpc(task, server->nfs_client);
>>> +
>>> +     lgp->status = task->tk_status;
>>> +     dprintk("<-- %s\n", __func__);
>>> +}
>>> +
>>> +static void nfs4_layoutget_release(void *calldata)
>>> +{
>>> +     struct nfs4_layoutget *lgp = calldata;
>>> +
>>> +     dprintk("--> %s\n", __func__);
>>> +     put_layout_hdr(lgp->args.inode);
>>> +     if (lgp->res.layout.buf != NULL)
>>> +             free_page((unsigned long) lgp->res.layout.buf);
>>> +     put_nfs_open_context(lgp->args.ctx);
>>> +     kfree(calldata);
>>> +     dprintk("<-- %s\n", __func__);
>>> +}
>>> +
>>> +static const struct rpc_call_ops nfs4_layoutget_call_ops = {
>>> +     .rpc_call_prepare = nfs4_layoutget_prepare,
>>> +     .rpc_call_done = nfs4_layoutget_done,
>>> +     .rpc_release = nfs4_layoutget_release,
>>> +};
>>> +
>>> +static int _nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
>>> +{
>>> +     struct nfs_server *server = NFS_SERVER(lgp->args.inode);
>>> +     struct rpc_task *task;
>>> +     struct rpc_message msg = {
>>> +             .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_LAYOUTGET],
>>> +             .rpc_argp = &lgp->args,
>>> +             .rpc_resp = &lgp->res,
>>> +     };
>>> +     struct rpc_task_setup task_setup_data = {
>>> +             .rpc_client = server->client,
>>> +             .rpc_message = &msg,
>>> +             .callback_ops = &nfs4_layoutget_call_ops,
>>> +             .callback_data = lgp,
>>> +             .flags = RPC_TASK_ASYNC,
>>> +     };
>>> +     int status = 0;
>>> +
>>> +     dprintk("--> %s\n", __func__);
>>> +
>>> +     lgp->res.layout.buf = (void *)__get_free_page(GFP_NOFS);
>>> +     if (lgp->res.layout.buf == NULL) {
>>> +             nfs4_layoutget_release(lgp);
>>> +             return -ENOMEM;
>>> +     }
>>> +
>>> +     lgp->res.seq_res.sr_slotid = NFS4_MAX_SLOT_TABLE;
>>> +     task = rpc_run_task(&task_setup_data);
>>> +     if (IS_ERR(task))
>>> +             return PTR_ERR(task);
>>> +     status = nfs4_wait_for_completion_rpc_task(task);
>>> +     if (status != 0)
>>> +             goto out;
>>> +     status = lgp->status;
>>> +     if (status != 0)
>>> +             goto out;
>>> +     status = pnfs_layout_process(lgp);
>>> +out:
>>> +     rpc_put_task(task);
>>> +     dprintk("<-- %s status=%d\n", __func__, status);
>>> +     return status;
>>> +}
>>> +
>>> +int nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
>>> +{
>>> +     struct nfs_server *server = NFS_SERVER(lgp->args.inode);
>>> +     struct nfs4_exception exception = { };
>>> +     int err;
>>> +     do {
>>> +             err = nfs4_handle_exception(server, _nfs4_proc_layoutget(lgp),
>>> +                                         &exception);
>>> +     } while (exception.retry);
>>> +     return err;
>>> +}
>>
>> Since nfs4_layoutget_done() already calls nfs4_async_handle_error(), do
>> you really need to call nfs4_handle_exception()?
>>
> 
> 
> Hmmm, since it is being called synchronously at the moment, we should
> probably remove the nfs4_async_handle_error call.
> 

Agreed.  Just leave the exception handling here.

> 
>>> +
>>> +int nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
>>> +{
>>> +     struct nfs4_getdeviceinfo_args args = {
>>> +             .pdev = pdev,
>>> +     };
>>> +     struct nfs4_getdeviceinfo_res res = {
>>> +             .pdev = pdev,
>>> +     };
>>> +     struct rpc_message msg = {
>>> +             .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_GETDEVICEINFO],
>>> +             .rpc_argp = &args,
>>> +             .rpc_resp = &res,
>>> +     };
>>> +     int status;
>>> +
>>> +     dprintk("--> %s\n", __func__);
>>> +     status = nfs4_call_sync(server, &msg, &args, &res, 0);
>>> +     dprintk("<-- %s status=%d\n", __func__, status);
>>> +
>>> +     return status;
>>> +}
>>> +EXPORT_SYMBOL_GPL(nfs4_proc_getdeviceinfo);
>>> +
>>
>> This, on the other hand, might need a 'handle exception' wrapper.
> 
> I agree.
> 
> 
>>
>>>  #endif /* CONFIG_NFS_V4_1 */
>>>
>>>  struct nfs4_state_recovery_ops nfs40_reboot_recovery_ops = {
>>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
>>> index 60233ae..aaf6fe5 100644
>>> --- a/fs/nfs/nfs4xdr.c
>>> +++ b/fs/nfs/nfs4xdr.c
>>> @@ -52,6 +52,7 @@
>>>  #include <linux/nfs_idmap.h>
>>>  #include "nfs4_fs.h"
>>>  #include "internal.h"
>>> +#include "pnfs.h"
>>>
>>>  #define NFSDBG_FACILITY              NFSDBG_XDR
>>>
>>> @@ -310,6 +311,19 @@ static int nfs4_stat_to_errno(int);
>>>                               XDR_QUADLEN(NFS4_MAX_SESSIONID_LEN) + 5)
>>>  #define encode_reclaim_complete_maxsz        (op_encode_hdr_maxsz + 4)
>>>  #define decode_reclaim_complete_maxsz        (op_decode_hdr_maxsz + 4)
>>> +#define encode_getdeviceinfo_maxsz (op_encode_hdr_maxsz + 4 + \
>>> +                             XDR_QUADLEN(NFS4_PNFS_DEVICEID4_SIZE))
>>> +#define decode_getdeviceinfo_maxsz (op_decode_hdr_maxsz + \
>>> +                             1 /* layout type */ + \
>>> +                             1 /* opaque devaddr4 length */ + \
>>> +                               /* devaddr4 payload is read into page */ \
>>> +                             1 /* notification bitmap length */ + \
>>> +                             1 /* notification bitmap */)
>>> +#define encode_layoutget_maxsz       (op_encode_hdr_maxsz + 10 + \
>>> +                             encode_stateid_maxsz)
>>> +#define decode_layoutget_maxsz       (op_decode_hdr_maxsz + 8 + \
>>> +                             decode_stateid_maxsz + \
>>> +                             XDR_QUADLEN(PNFS_LAYOUT_MAXSIZE))
>>>  #else /* CONFIG_NFS_V4_1 */
>>>  #define encode_sequence_maxsz        0
>>>  #define decode_sequence_maxsz        0
>>> @@ -699,6 +713,20 @@ static int nfs4_stat_to_errno(int);
>>>  #define NFS4_dec_reclaim_complete_sz (compound_decode_hdr_maxsz + \
>>>                                        decode_sequence_maxsz + \
>>>                                        decode_reclaim_complete_maxsz)
>>> +#define NFS4_enc_getdeviceinfo_sz (compound_encode_hdr_maxsz +    \
>>> +                             encode_sequence_maxsz +\
>>> +                             encode_getdeviceinfo_maxsz)
>>> +#define NFS4_dec_getdeviceinfo_sz (compound_decode_hdr_maxsz +    \
>>> +                             decode_sequence_maxsz + \
>>> +                             decode_getdeviceinfo_maxsz)
>>> +#define NFS4_enc_layoutget_sz        (compound_encode_hdr_maxsz + \
>>> +                             encode_sequence_maxsz + \
>>> +                             encode_putfh_maxsz +        \
>>> +                             encode_layoutget_maxsz)
>>> +#define NFS4_dec_layoutget_sz        (compound_decode_hdr_maxsz + \
>>> +                             decode_sequence_maxsz + \
>>> +                             decode_putfh_maxsz +        \
>>> +                             decode_layoutget_maxsz)
>>>
>>>  const u32 nfs41_maxwrite_overhead = ((RPC_MAX_HEADER_WITH_AUTH +
>>>                                     compound_encode_hdr_maxsz +
>>> @@ -1726,6 +1754,61 @@ static void encode_sequence(struct xdr_stream *xdr,
>>>  #endif /* CONFIG_NFS_V4_1 */
>>>  }
>>>
>>> +#ifdef CONFIG_NFS_V4_1
>>> +static void
>>> +encode_getdeviceinfo(struct xdr_stream *xdr,
>>> +                  const struct nfs4_getdeviceinfo_args *args,
>>> +                  struct compound_hdr *hdr)
>>> +{
>>> +     int has_bitmap = (args->pdev->dev_notify_types != 0);
>>> +     int len = 16 + NFS4_PNFS_DEVICEID4_SIZE + (has_bitmap * 4);
>>> +     __be32 *p;
>>> +
>>> +     p = reserve_space(xdr, len);
>>> +     *p++ = cpu_to_be32(OP_GETDEVICEINFO);
>>> +     p = xdr_encode_opaque_fixed(p, args->pdev->dev_id.data,
>>> +                                 NFS4_PNFS_DEVICEID4_SIZE);
>>> +     *p++ = cpu_to_be32(args->pdev->layout_type);
>>> +     *p++ = cpu_to_be32(args->pdev->pglen);          /* gdia_maxcount */
>>> +     *p++ = cpu_to_be32(has_bitmap);                 /* bitmap length [01] */
>>> +     if (has_bitmap)
>>> +             *p = cpu_to_be32(args->pdev->dev_notify_types);
>>
>> We don't support notification callbacks yet.
>>
> 
> OK, I'll rip this out and just set the bitmap to zero.
> 
>>> +     hdr->nops++;
>>> +     hdr->replen += decode_getdeviceinfo_maxsz;
>>> +}
>>> +
>>> +static void
>>> +encode_layoutget(struct xdr_stream *xdr,
>>> +                   const struct nfs4_layoutget_args *args,
>>> +                   struct compound_hdr *hdr)
>>> +{
>>> +     nfs4_stateid stateid;
>>> +     __be32 *p;
>>> +
>>> +     p = reserve_space(xdr, 44 + NFS4_STATEID_SIZE);
>>> +     *p++ = cpu_to_be32(OP_LAYOUTGET);
>>> +     *p++ = cpu_to_be32(0);     /* Signal layout available */
>>> +     *p++ = cpu_to_be32(args->type);
>>> +     *p++ = cpu_to_be32(args->range.iomode);
>>> +     p = xdr_encode_hyper(p, args->range.offset);
>>> +     p = xdr_encode_hyper(p, args->range.length);
>>> +     p = xdr_encode_hyper(p, args->minlength);
>>> +     pnfs_get_layout_stateid(&stateid, NFS_I(args->inode)->layout);
>>> +     p = xdr_encode_opaque_fixed(p, &stateid.data, NFS4_STATEID_SIZE);
>>> +     *p = cpu_to_be32(args->maxcount);
>>> +
>>> +     dprintk("%s: 1st type:0x%x iomode:%d off:%lu len:%lu mc:%d\n",
>>> +             __func__,
>>> +             args->type,
>>> +             args->range.iomode,
>>> +             (unsigned long)args->range.offset,
>>> +             (unsigned long)args->range.length,
>>> +             args->maxcount);
>>> +     hdr->nops++;
>>> +     hdr->replen += decode_layoutget_maxsz;
>>> +}
>>> +#endif /* CONFIG_NFS_V4_1 */
>>> +
>>>  /*
>>>   * END OF "GENERIC" ENCODE ROUTINES.
>>>   */
>>> @@ -2543,6 +2626,51 @@ static int nfs4_xdr_enc_reclaim_complete(struct rpc_rqst *req, uint32_t *p,
>>>       return 0;
>>>  }
>>>
>>> +/*
>>> + * Encode GETDEVICEINFO request
>>> + */
>>> +static int nfs4_xdr_enc_getdeviceinfo(struct rpc_rqst *req, uint32_t *p,
>>> +                                   struct nfs4_getdeviceinfo_args *args)
>>> +{
>>> +     struct xdr_stream xdr;
>>> +     struct compound_hdr hdr = {
>>> +             .minorversion = nfs4_xdr_minorversion(&args->seq_args),
>>> +     };
>>> +
>>> +     xdr_init_encode(&xdr, &req->rq_snd_buf, p);
>>> +     encode_compound_hdr(&xdr, req, &hdr);
>>> +     encode_sequence(&xdr, &args->seq_args, &hdr);
>>> +     encode_getdeviceinfo(&xdr, args, &hdr);
>>> +
>>> +     /* set up reply kvec. Subtract notification bitmap max size (2)
>>> +      * so that notification bitmap is put in xdr_buf tail */
>>> +     xdr_inline_pages(&req->rq_rcv_buf, (hdr.replen - 2) << 2,
>>> +                      args->pdev->pages, args->pdev->pgbase,
>>> +                      args->pdev->pglen);
>>> +
>>> +     encode_nops(&hdr);
>>> +     return 0;
>>> +}
>>> +
>>> +/*
>>> + *  Encode LAYOUTGET request
>>> + */
>>> +static int nfs4_xdr_enc_layoutget(struct rpc_rqst *req, uint32_t *p,
>>> +                               struct nfs4_layoutget_args *args)
>>> +{
>>> +     struct xdr_stream xdr;
>>> +     struct compound_hdr hdr = {
>>> +             .minorversion = nfs4_xdr_minorversion(&args->seq_args),
>>> +     };
>>> +
>>> +     xdr_init_encode(&xdr, &req->rq_snd_buf, p);
>>> +     encode_compound_hdr(&xdr, req, &hdr);
>>> +     encode_sequence(&xdr, &args->seq_args, &hdr);
>>> +     encode_putfh(&xdr, NFS_FH(args->inode), &hdr);
>>> +     encode_layoutget(&xdr, args, &hdr);
>>> +     encode_nops(&hdr);
>>> +     return 0;
>>> +}
>>>  #endif /* CONFIG_NFS_V4_1 */
>>>
>>>  static void print_overflow_msg(const char *func, const struct xdr_stream *xdr)
>>> @@ -4788,6 +4916,131 @@ out_overflow:
>>>  #endif /* CONFIG_NFS_V4_1 */
>>>  }
>>>
>>> +#if defined(CONFIG_NFS_V4_1)
>>> +
>>> +static int decode_getdeviceinfo(struct xdr_stream *xdr,
>>> +                             struct pnfs_device *pdev)
>>> +{
>>> +     __be32 *p;
>>> +     uint32_t len, type;
>>> +     int status;
>>> +
>>> +     status = decode_op_hdr(xdr, OP_GETDEVICEINFO);
>>> +     if (status) {
>>> +             if (status == -ETOOSMALL) {
>>> +                     p = xdr_inline_decode(xdr, 4);
>>> +                     if (unlikely(!p))
>>> +                             goto out_overflow;
>>> +                     pdev->mincount = be32_to_cpup(p);
>>> +                     dprintk("%s: Min count too small. mincnt = %u\n",
>>> +                             __func__, pdev->mincount);
>>> +             }
>>> +             return status;
>>> +     }
>>> +
>>> +     p = xdr_inline_decode(xdr, 8);
>>> +     if (unlikely(!p))
>>> +             goto out_overflow;
>>> +     type = be32_to_cpup(p++);
>>> +     if (type != pdev->layout_type) {
>>> +             dprintk("%s: layout mismatch req: %u pdev: %u\n",
>>> +                     __func__, pdev->layout_type, type);
>>> +             return -EINVAL;
>>> +     }
>>> +     /*
>>> +      * Get the length of the opaque device_addr4. xdr_read_pages places
>>> +      * the opaque device_addr4 in the xdr_buf->pages (pnfs_device->pages)
>>> +      * and places the remaining xdr data in xdr_buf->tail
>>> +      */
>>> +     pdev->mincount = be32_to_cpup(p);
>>> +     xdr_read_pages(xdr, pdev->mincount); /* include space for the length */
>>> +
>>> +     /*
>>> +      * At most one bitmap word. If the server returns a bitmap of more
>>> +      * than one word we ignore the extra invalid words given that
>>> +      * getdeviceinfo is the final operation in the compound.
>>> +      */
>>> +     p = xdr_inline_decode(xdr, 4);
>>> +     if (unlikely(!p))
>>> +             goto out_overflow;
>>> +     len = be32_to_cpup(p);
>>> +     if (len) {
>>> +             p = xdr_inline_decode(xdr, 4);
>>> +             if (unlikely(!p))
>>> +                     goto out_overflow;
>>> +             pdev->dev_notify_types = be32_to_cpup(p);
>>> +     } else
>>> +             pdev->dev_notify_types = 0;
>>
>> Again, we don't support notifications.
>>
> 
> OK.
> 
> 
>>> +     return 0;
>>> +out_overflow:
>>> +     print_overflow_msg(__func__, xdr);
>>> +     return -EIO;
>>> +}
>>> +
>>> +static int decode_layoutget(struct xdr_stream *xdr, struct rpc_rqst *req,
>>> +                         struct nfs4_layoutget_res *res)
>>> +{
>>> +     __be32 *p;
>>> +     int status;
>>> +     u32 layout_count;
>>> +
>>> +     status = decode_op_hdr(xdr, OP_LAYOUTGET);
>>> +     if (status)
>>> +             return status;
>>> +     p = xdr_inline_decode(xdr, 8 + NFS4_STATEID_SIZE);
>>> +     if (unlikely(!p))
>>> +             goto out_overflow;
>>> +     res->return_on_close = be32_to_cpup(p++);
>>> +     p = xdr_decode_opaque_fixed(p, res->stateid.data, NFS4_STATEID_SIZE);
>>> +     layout_count = be32_to_cpup(p);
>>> +     if (!layout_count) {
>>> +             dprintk("%s: server responded with empty layout array\n",
>>> +                     __func__);
>>> +             return -EINVAL;
>>> +     }
>>> +
>>> +     p = xdr_inline_decode(xdr, 24);
>>> +     if (unlikely(!p))
>>> +             goto out_overflow;
>>> +     p = xdr_decode_hyper(p, &res->range.offset);
>>> +     p = xdr_decode_hyper(p, &res->range.length);
>>> +     res->range.iomode = be32_to_cpup(p++);
>>> +     res->type = be32_to_cpup(p++);
>>> +
>>> +     status = decode_opaque_inline(xdr, &res->layout.len, (char **)&p);
>>> +     if (unlikely(status))
>>> +             return status;
>>> +
>>> +     dprintk("%s roff:%lu rlen:%lu riomode:%d, lo_type:0x%x, lo.len:%d\n",
>>> +             __func__,
>>> +             (unsigned long)res->range.offset,
>>> +             (unsigned long)res->range.length,
>>> +             res->range.iomode,
>>> +             res->type,
>>> +             res->layout.len);
>>> +
>>> +     /* nfs4_proc_layoutget allocated a single page */
>>> +     if (res->layout.len > PAGE_SIZE)
>>> +             return -ENOMEM;
>>> +     memcpy(res->layout.buf, p, res->layout.len);
>>> +
>>> +     if (layout_count > 1) {
>>> +             /* We only handle a length one array at the moment.  Any
>>> +              * further entries are just ignored.  Note that this means
>>> +              * the client may see a response that is less than the
>>> +              * minimum it requested.
>>> +              */
>>> +             dprintk("%s: server responded with %d layouts, dropping tail\n",
>>> +                     __func__, layout_count);
>>> +     }
>>> +
>>> +     return 0;
>>> +out_overflow:
>>> +     print_overflow_msg(__func__, xdr);
>>> +     return -EIO;
>>> +}
>>> +#endif /* CONFIG_NFS_V4_1 */
>>> +
>>>  /*
>>>   * END OF "GENERIC" DECODE ROUTINES.
>>>   */
>>> @@ -5815,6 +6068,53 @@ static int nfs4_xdr_dec_reclaim_complete(struct rpc_rqst *rqstp, uint32_t *p,
>>>               status = decode_reclaim_complete(&xdr, (void *)NULL);
>>>       return status;
>>>  }
>>> +
>>> +/*
>>> + * Decode GETDEVINFO response
>>> + */
>>> +static int nfs4_xdr_dec_getdeviceinfo(struct rpc_rqst *rqstp, uint32_t *p,
>>> +                                   struct nfs4_getdeviceinfo_res *res)
>>> +{
>>> +     struct xdr_stream xdr;
>>> +     struct compound_hdr hdr;
>>> +     int status;
>>> +
>>> +     xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
>>> +     status = decode_compound_hdr(&xdr, &hdr);
>>> +     if (status != 0)
>>> +             goto out;
>>> +     status = decode_sequence(&xdr, &res->seq_res, rqstp);
>>> +     if (status != 0)
>>> +             goto out;
>>> +     status = decode_getdeviceinfo(&xdr, res->pdev);
>>> +out:
>>> +     return status;
>>> +}
>>> +
>>> +/*
>>> + * Decode LAYOUTGET response
>>> + */
>>> +static int nfs4_xdr_dec_layoutget(struct rpc_rqst *rqstp, uint32_t *p,
>>> +                               struct nfs4_layoutget_res *res)
>>> +{
>>> +     struct xdr_stream xdr;
>>> +     struct compound_hdr hdr;
>>> +     int status;
>>> +
>>> +     xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
>>> +     status = decode_compound_hdr(&xdr, &hdr);
>>> +     if (status)
>>> +             goto out;
>>> +     status = decode_sequence(&xdr, &res->seq_res, rqstp);
>>> +     if (status)
>>> +             goto out;
>>> +     status = decode_putfh(&xdr);
>>> +     if (status)
>>> +             goto out;
>>> +     status = decode_layoutget(&xdr, rqstp, res);
>>> +out:
>>> +     return status;
>>> +}
>>>  #endif /* CONFIG_NFS_V4_1 */
>>>
>>>  __be32 *nfs4_decode_dirent(__be32 *p, struct nfs_entry *entry, int plus)
>>> @@ -5993,6 +6293,8 @@ struct rpc_procinfo     nfs4_procedures[] = {
>>>    PROC(SEQUENCE,     enc_sequence,   dec_sequence),
>>>    PROC(GET_LEASE_TIME,       enc_get_lease_time,     dec_get_lease_time),
>>>    PROC(RECLAIM_COMPLETE, enc_reclaim_complete,  dec_reclaim_complete),
>>> +  PROC(GETDEVICEINFO, enc_getdeviceinfo, dec_getdeviceinfo),
>>> +  PROC(LAYOUTGET,  enc_layoutget,     dec_layoutget),
>>>  #endif /* CONFIG_NFS_V4_1 */
>>>  };
>>>
>>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>>> index cbce942..faf6c4c 100644
>>> --- a/fs/nfs/pnfs.c
>>> +++ b/fs/nfs/pnfs.c
>>> @@ -128,6 +128,12 @@ pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>>>               return status;
>>>       }
>>>
>>> +     if (!io_ops->alloc_lseg || !io_ops->free_lseg) {
>>> +             printk(KERN_ERR "%s Layout driver must provide "
>>> +                    "alloc_lseg and free_lseg.\n", __func__);
>>> +             return status;
>>> +     }
>>> +
>>>       spin_lock(&pnfs_spinlock);
>>>       if (!find_pnfs_driver_locked(ld_type->id)) {
>>>               list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
>>> @@ -153,6 +159,10 @@ pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *ld_type)
>>>  }
>>>  EXPORT_SYMBOL(pnfs_unregister_layoutdriver);
>>>
>>> +/*
>>> + * pNFS client layout cache
>>> + */
>>> +
>>>  static void
>>>  get_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>>>  {
>>> @@ -175,6 +185,15 @@ put_layout_hdr_locked(struct pnfs_layout_hdr *lo)
>>>       }
>>>  }
>>>
>>> +void
>>> +put_layout_hdr(struct inode *inode)
>>> +{
>>> +     spin_lock(&inode->i_lock);
>>> +     put_layout_hdr_locked(NFS_I(inode)->layout);
>>> +     spin_unlock(&inode->i_lock);
>>> +
>>> +}
>>> +
>>>  static void
>>>  init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
>>>  {
>>> @@ -191,7 +210,7 @@ destroy_lseg(struct kref *kref)
>>>       struct pnfs_layout_hdr *local = lseg->layout;
>>>
>>>       dprintk("--> %s\n", __func__);
>>> -     kfree(lseg);
>>> +     PNFS_LD_IO_OPS(local)->free_lseg(lseg);
>>
>> Where is PNFS_LD_IO_OPS() defined? Besides, I thought we agreed to get
>> rid of that.
> 
> This is defined in pnfs.h as
> PNFS_NFS_SERVER()->pnfs_curr_ld->ld_io_iops, mainly to save typing.
> 
> The macro that you had objected to was PNFS_EXISTS_LDIO_OP form
> Benny's tree, which is now gone.
> 
>>
>>>       /* Matched by get_layout_hdr_locked in pnfs_insert_layout */
>>>       put_layout_hdr_locked(local);
>>>  }
>>> @@ -226,6 +245,7 @@ pnfs_clear_lseg_list(struct pnfs_layout_hdr *lo)
>>>       /* List does not take a reference, so no need for put here */
>>>       list_del_init(&lo->layouts);
>>>       spin_unlock(&clp->cl_lock);
>>> +     pnfs_set_layout_stateid(lo, &zero_stateid);
>>>
>>>       dprintk("%s:Return\n", __func__);
>>>  }
>>> @@ -268,40 +288,120 @@ pnfs_destroy_all_layouts(struct nfs_client *clp)
>>>       }
>>>  }
>>>
>>> -static void pnfs_insert_layout(struct pnfs_layout_hdr *lo,
>>> -                            struct pnfs_layout_segment *lseg);
>>> +void
>>> +pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
>>> +                     const nfs4_stateid *stateid)
>>> +{
>>> +     write_seqlock(&lo->seqlock);
>>> +     memcpy(lo->stateid.data, stateid->data, sizeof(lo->stateid.data));
>>> +     write_sequnlock(&lo->seqlock);
>>> +}
>>> +
>>> +void
>>> +pnfs_get_layout_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo)
>>> +{
>>> +     int seq;
>>>
>>> -/* Get layout from server. */
>>> +     dprintk("--> %s\n", __func__);
>>> +
>>> +     do {
>>> +             seq = read_seqbegin(&lo->seqlock);
>>> +             memcpy(dst->data, lo->stateid.data,
>>> +                    sizeof(lo->stateid.data));
>>> +     } while (read_seqretry(&lo->seqlock, seq));
>>> +
>>> +     dprintk("<-- %s\n", __func__);
>>> +}
>>> +
>>> +static void
>>> +pnfs_layout_from_open_stateid(struct pnfs_layout_hdr *lo,
>>> +                           struct nfs4_state *state)
>>> +{
>>> +     int seq;
>>> +
>>> +     dprintk("--> %s\n", __func__);
>>> +
>>> +     write_seqlock(&lo->seqlock);
>>> +     /* Zero stateid, which is illegal to use in layout, is our
>>> +      * marker for an un-initialized stateid.
>>> +      */
>>
>> Isn't it easier just to have a flag in the layout?
>>

It's possible.

>>> +     if (!memcmp(lo->stateid.data, &zero_stateid, NFS4_STATEID_SIZE))
>>> +             do {
>>> +                     seq = read_seqbegin(&state->seqlock);
>>> +                     memcpy(lo->stateid.data, state->stateid.data,
>>> +                                     sizeof(state->stateid.data));
>>> +             } while (read_seqretry(&state->seqlock, seq));
>>> +     write_sequnlock(&lo->seqlock);
>>
>> ...and if memcmp(), is the caller supposed to detect that nothing was
>> done?
>>

Not sure I understand your question...
You mean in case state->stateid.data is zero as well?

>>> +     dprintk("<-- %s\n", __func__);
>>> +}
>>> +
>>> +/*
>>> +* Get layout from server.
>>> +*    for now, assume that whole file layouts are requested.
>>> +*    arg->offset: 0
>>> +*    arg->length: all ones
>>> +*/
>>>  static struct pnfs_layout_segment *
>>>  send_layoutget(struct pnfs_layout_hdr *lo,
>>>          struct nfs_open_context *ctx,
>>>          u32 iomode)
>>>  {
>>>       struct inode *ino = lo->inode;
>>> -     struct pnfs_layout_segment *lseg;
>>> +     struct nfs_server *server = NFS_SERVER(ino);
>>> +     struct nfs4_layoutget *lgp;
>>> +     struct pnfs_layout_segment *lseg = NULL;
>>>
>>> -     /* Lets pretend we sent LAYOUTGET and got a response */
>>> -     lseg = kzalloc(sizeof(*lseg), GFP_KERNEL);
>>> +     dprintk("--> %s\n", __func__);
>>> +
>>> +     BUG_ON(ctx == NULL);
>>> +     lgp = kzalloc(sizeof(*lgp), GFP_KERNEL);
>>> +     if (lgp == NULL) {
>>> +             put_layout_hdr(lo->inode);
>>> +             return NULL;
>>> +     }
>>> +     lgp->args.minlength = NFS4_MAX_UINT64;
>>> +     lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
>>> +     lgp->args.range.iomode = iomode;
>>> +     lgp->args.range.offset = 0;
>>> +     lgp->args.range.length = NFS4_MAX_UINT64;
>>> +     lgp->args.type = server->pnfs_curr_ld->id;
>>> +     lgp->args.inode = ino;
>>> +     lgp->args.ctx = get_nfs_open_context(ctx);
>>> +     lgp->lsegpp = &lseg;
>>> +
>>> +     if (!memcmp(lo->stateid.data, &zero_stateid, NFS4_STATEID_SIZE))
>>> +             pnfs_layout_from_open_stateid(NFS_I(ino)->layout, ctx->state);
>>
>> Why do an extra memcmp() here?

Yeah, actually the one in pnfs_layout_from_open_stateid() can be removed,
or it can be open coded here since this is the only call site.

Benny

> 
> OK, clearly the function and call to pnfs_layout_from_open_stateid
> need to be reexamined.
> 
> Fred
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
       [not found]             ` <AANLkTimONZfA6ZX4xtzbmy0NdfEtbwMAi+__PhFYznTn-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-09-13 14:23               ` Benny Halevy
  0 siblings, 0 replies; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 14:23 UTC (permalink / raw)
  To: Fred Isaman; +Cc: Trond Myklebust, linux-nfs

On 2010-09-13 15:01, Fred Isaman wrote:
> On Mon, Sep 13, 2010 at 3:32 AM, Benny Halevy <bhalevy@panasas.com> wrote:
>> On 2010-09-11 01:37, Trond Myklebust wrote:
>>> On Fri, 2010-09-10 at 14:11 -0700, Fred Isaman wrote:
>>>> On Fri, Sep 10, 2010 at 12:31 PM, Trond Myklebust
>>>> <Trond.Myklebust@netapp.com> wrote:
>>>>> On Thu, 2010-09-02 at 14:00 -0400, Fred Isaman wrote:
>>>> OK
>>>>
>>>>>> +     .initialize_mountpoint   = filelayout_initialize_mountpoint,
>>>>>> +     .uninitialize_mountpoint = filelayout_uninitialize_mountpoint,
>>>>>> +};
>>>>>> +
>>>>>> +
>>>>>> +struct pnfs_layoutdriver_type filelayout_type = {
>>>>>
>>>>> Ditto.
>>>>
>>>> This includes a list_head field which is set by the generic layer.
>>>>
>>>>>
>>>>>> +     .id = LAYOUT_NFSV4_1_FILES,
>>>>>> +     .name = "LAYOUT_NFSV4_1_FILES",
>>>>>> +     .ld_io_ops = &filelayout_io_operations,
>>>>>
>>>>> Why do we need a separate 'struct layoutdriver_io_operations'? Any
>>>>> reason those can't just be embedded in struct pnfs_layoutdriver_type?
>>>>
>>>> I believe this decision was primarily aesthetics.  However, keeping
>>>> the static io_ops seperate from the variable list_head seems like a
>>>> good idea.
>>>
>>> I dunno. They are in a 1-1 correspondence, so I'm not sure I see the
>>> need for a separation.
>>>
>>
>> Later in the game we introduce the layout driver policy ops.
>> That said, they could be added to the same vector as the io ops.
> 
> 
> Yes, I think they should be merged when we get to that stage.
> 
> 
>>
>>>> Perhaps having a driver structure that includes the io_ops and static
>>>> portions of pnfs_layoutdriver_type, with the generic layer allocating
>>>> a wrapper structure that is basically:
>>>> struct {
>>>>     struct list_head list;
>>>>     struct pnfs_layoutdriver_type *driver_info;
>>>
>>>       Should be const...
>>>
>>>       struct module *owner = THIS_MODULE;
>>>
>>>> }
>>>
>>> ...although the struct module could probably indeed be part of
>>> pnfs_layoutdriver_type too.
>>
>> Agreed.
>> I think we should just have
>>
>> struct pnfs_layoutdriver_type {
>>        struct list_head pnfs_tblid;
>>        const u32 id;
>>        const char *name;
>>        const struct module *owner;
>>        const struct layoutdriver_operations *ld_ops;
>>  };
>>
>> Benny
>>
> 
> I'll point out that what I took from the above conversation, was that
> the fields id, name, and possibly owner should be placed inside struct
> layoutdriver_operations (which should probably be renamed slightly).
> 
> 

just keep it simple please :)

Benny

>>>
>>> Cheers
>>>   Trond
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-13 11:24       ` Benny Halevy
  2010-09-13 12:29         ` Trond Myklebust
@ 2010-09-13 14:28         ` Christoph Hellwig
  2010-09-13 14:39           ` Benny Halevy
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2010-09-13 14:28 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Trond Myklebust, Christoph Hellwig, Fred Isaman, linux-nfs

On Mon, Sep 13, 2010 at 01:24:51PM +0200, Benny Halevy wrote:
> However, on the longer run I'd like us to consider formalizing
> the kABI for non-GPLed layout drivers.

No.  Non-GPLed drivers will have a very hard way to stand against the
derived work clauses for some specificly written new code anyway, and
if you haven't noticed yet there's no kABI in mainline anyway.  

> I think that this is a great selling point as it fully materializes the
> extensibility of the layout-type / layout-driver design model.

Drinking again?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-13 12:29         ` Trond Myklebust
@ 2010-09-13 14:37           ` Benny Halevy
  2010-09-13 16:55             ` Trond Myklebust
  0 siblings, 1 reply; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 14:37 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Christoph Hellwig, Fred Isaman, linux-nfs

On 2010-09-13 14:29, Trond Myklebust wrote:
> On Mon, 2010-09-13 at 13:24 +0200, Benny Halevy wrote:
>> On 2010-09-11 03:07, Trond Myklebust wrote:
>>> On Fri, 2010-09-10 at 19:58 -0400, Christoph Hellwig wrote:
>>>>> +EXPORT_SYMBOL(pnfs_register_layoutdriver);
>>>>
>>>> Al exports from nfs.ko needs to be _GPL - this is in no way a public
>>>> API, just an internal subdivision of the nfs client.
>>>
>>> ACK. We're not committing to supporting a stable ABI here in any way,
>>> shape or form...
>>
>> OK, not yet.
>>
>> However, on the longer run I'd like us to consider formalizing
>> the kABI for non-GPLed layout drivers.
>>
>> I think that this is a great selling point as it fully materializes the
>> extensibility of the layout-type / layout-driver design model.
> 
> No. I'm not committing to a kabi, ever...

Care to explain your reasoning in some more details?

> 
> Trond
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-13 14:28         ` Christoph Hellwig
@ 2010-09-13 14:39           ` Benny Halevy
  0 siblings, 0 replies; 55+ messages in thread
From: Benny Halevy @ 2010-09-13 14:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Trond Myklebust, Fred Isaman, linux-nfs

On 2010-09-13 16:28, Christoph Hellwig wrote:
> On Mon, Sep 13, 2010 at 01:24:51PM +0200, Benny Halevy wrote:
>> However, on the longer run I'd like us to consider formalizing
>> the kABI for non-GPLed layout drivers.
> 
> No.  Non-GPLed drivers will have a very hard way to stand against the
> derived work clauses for some specificly written new code anyway, and
> if you haven't noticed yet there's no kABI in mainline anyway.  
> 
>> I think that this is a great selling point as it fully materializes the
>> extensibility of the layout-type / layout-driver design model.
> 
> Drinking again?
> 

Heh, just water :-)

I guess I'll have to agree to disagree.

Benny

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-13 11:06     ` Boaz Harrosh
@ 2010-09-13 14:44       ` Christoph Hellwig
  2010-09-13 15:14         ` Boaz Harrosh
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2010-09-13 14:44 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Trond Myklebust, Fred Isaman, linux-nfs

On Mon, Sep 13, 2010 at 01:06:53PM +0200, Boaz Harrosh wrote:
> Me two BTW.
> I will later need a per super-block resources, mainly slabs and work_q.

Slab caches are always globals, as should workqueues be.  With Tejun's
recent work most work queues should go away completely.

> I think we should just pass nfss to the LD and let it take care of accessing
> nfss->nfs_client for the device cache. (It is done so later in the tree, right?)

You can add this later once a laypout driver actually needing it is
introduced.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-10 22:37       ` Trond Myklebust
  2010-09-13 10:32         ` Benny Halevy
@ 2010-09-13 14:48         ` Christoph Hellwig
  1 sibling, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2010-09-13 14:48 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Fred Isaman, linux-nfs

On Fri, Sep 10, 2010 at 06:37:03PM -0400, Trond Myklebust wrote:
>       struct module *owner = THIS_MODULE;
> 
> > }
> 
> ...although the struct module could probably indeed be part of
> pnfs_layoutdriver_type too.

If we only ever have one instances of the ops per type that's absolutely
the way to go.  const ops structures are nice to have, but if means
making the code more complex just for that there's no point.

Take a look at e.g. struct file_system_type.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-02 18:00 ` [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure Fred Isaman
  2010-09-10 19:23   ` Trond Myklebust
  2010-09-10 23:58   ` Christoph Hellwig
@ 2010-09-13 15:07   ` Christoph Hellwig
  2010-09-13 15:27     ` Fred Isaman
  2 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2010-09-13 15:07 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Thu, Sep 02, 2010 at 02:00:13PM -0400, Fred Isaman wrote:
>  static struct pnfs_layoutdriver_type *
>  find_pnfs_driver(u32 id) {
> -	return NULL;
> +	struct  pnfs_layoutdriver_type *local;
> +
> +	spin_lock(&pnfs_spinlock);
> +	local = find_pnfs_driver_locked(id);
> +	spin_unlock(&pnfs_spinlock);
> +	return local;

What about refcounting the structure?

> +	spin_lock(&pnfs_spinlock);
> +	if (!find_pnfs_driver_locked(ld_type->id)) {
> +		list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
> +		status = 0;
> +		dprintk("%s Registering id:%u name:%s\n", __func__, ld_type->id,
> +			ld_type->name);
> +	} else
> +		printk(KERN_ERR "%s Module with id %d already loaded!\n",
> +			__func__, ld_type->id);
> +	spin_unlock(&pnfs_spinlock);

In other places we generally don't bother doing these checks, so for
pnfs with just three clients it's even more poinless.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-02 18:00 ` [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver Fred Isaman
  2010-09-10 19:31   ` Trond Myklebust
@ 2010-09-13 15:08   ` Christoph Hellwig
  2010-09-13 15:16     ` Fred Isaman
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2010-09-13 15:08 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs

On Thu, Sep 02, 2010 at 02:00:14PM -0400, Fred Isaman wrote:
> +static int __init nfs4filelayout_init(void)
> +{
> +	printk(KERN_INFO "%s: NFSv4 File Layout Driver Registering...\n",
> +	       __func__);
> +
> +	/*
> +	 * Need to register file_operations struct with global list to indicate
> +	 * that NFS4 file layout is a possible pNFS I/O module
> +	 */
> +	return pnfs_register_layoutdriver(&filelayout_type);

And I thought we were dealing with an UFO detection device here..

Sarcasm aside, this is one of the most pointless and incorrect comments I've
seen for a while.

> +	/* Unregister NFS4 file layout driver with pNFS client*/
> +	pnfs_unregister_layoutdriver(&filelayout_type);

This one is not too useful either.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-13 14:44       ` Christoph Hellwig
@ 2010-09-13 15:14         ` Boaz Harrosh
  0 siblings, 0 replies; 55+ messages in thread
From: Boaz Harrosh @ 2010-09-13 15:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Trond Myklebust, Fred Isaman, linux-nfs

On 09/13/2010 04:44 PM, Christoph Hellwig wrote:
> On Mon, Sep 13, 2010 at 01:06:53PM +0200, Boaz Harrosh wrote:
>> Me two BTW.
>> I will later need a per super-block resources, mainly slabs and work_q.
> 
> Slab caches are always globals, as should workqueues be.  With Tejun's
> recent work most work queues should go away completely.
> 

Right, sorry wrong terminology. I have pages pre-allocation for raid calculation
I always keep a minimum of full stripe raid pages per SB and scale up to the amount
of concurrent writes, with some lazy deallocation. It's all very preliminary.

The thread is so I can issue async writes/reads from io completion. So you
are right this should be global now.

>> I think we should just pass nfss to the LD and let it take care of accessing
>> nfss->nfs_client for the device cache. (It is done so later in the tree, right?)
> 
> You can add this later once a laypout driver actually needing it is
> introduced.
> 

You are right in principle, but this is one place that should be considered
from higher view. What happens is that we can only make the call after an
actual mount when we know the version/pnfs-usage/layout-type to be used.
Then when we know that, we have a per-nfs_client initialization to do.
Later in both blocks and objects we have other things to do peculiar to
each. The one call vector could serve them all. And the overall name of
that vector is therefore init_mountpoint(). The original code was passing
nfs-server as proper for it's name. Later attempts to hide the code future
and comply with files-only needs changed that to be an nfs-client pointer
since that was the only thing used. Now the name is wrong and the place
it is called from is confusing.

I think it is just simple, more clear, to just call it what it is and
pass the proper pointer. For the binary code it is all the same. For
code readability and review, an attempt to clarify just made it more
complicated. So why not just drop it.

Boaz

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver
  2010-09-13 15:08   ` Christoph Hellwig
@ 2010-09-13 15:16     ` Fred Isaman
  0 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-13 15:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nfs

On Mon, Sep 13, 2010 at 8:08 AM, Christoph Hellwig <hch@infradead.org> =
wrote:
> On Thu, Sep 02, 2010 at 02:00:14PM -0400, Fred Isaman wrote:
>> +static int __init nfs4filelayout_init(void)
>> +{
>> + =A0 =A0 printk(KERN_INFO "%s: NFSv4 File Layout Driver Registering=
=2E..\n",
>> + =A0 =A0 =A0 =A0 =A0 =A0__func__);
>> +
>> + =A0 =A0 /*
>> + =A0 =A0 =A0* Need to register file_operations struct with global l=
ist to indicate
>> + =A0 =A0 =A0* that NFS4 file layout is a possible pNFS I/O module
>> + =A0 =A0 =A0*/
>> + =A0 =A0 return pnfs_register_layoutdriver(&filelayout_type);
>
> And I thought we were dealing with an UFO detection device here..
>
> Sarcasm aside, this is one of the most pointless and incorrect commen=
ts I've
> seen for a while.
>
>> + =A0 =A0 /* Unregister NFS4 file layout driver with pNFS client*/
>> + =A0 =A0 pnfs_unregister_layoutdriver(&filelayout_type);
>
> This one is not too useful either.
>

OK

=46red

> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-13 15:07   ` Christoph Hellwig
@ 2010-09-13 15:27     ` Fred Isaman
  0 siblings, 0 replies; 55+ messages in thread
From: Fred Isaman @ 2010-09-13 15:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nfs

On Mon, Sep 13, 2010 at 8:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Sep 02, 2010 at 02:00:13PM -0400, Fred Isaman wrote:
>>  static struct pnfs_layoutdriver_type *
>>  find_pnfs_driver(u32 id) {
>> -     return NULL;
>> +     struct  pnfs_layoutdriver_type *local;
>> +
>> +     spin_lock(&pnfs_spinlock);
>> +     local = find_pnfs_driver_locked(id);
>> +     spin_unlock(&pnfs_spinlock);
>> +     return local;
>
> What about refcounting the structure?
>

The structure is pretty tightly tied to the lifetime of the driver
module.  It seems like grabbing a reference on the module (which as
Trond pointed out needs to be done) is enough.


>> +     spin_lock(&pnfs_spinlock);
>> +     if (!find_pnfs_driver_locked(ld_type->id)) {
>> +             list_add(&ld_type->pnfs_tblid, &pnfs_modules_tbl);
>> +             status = 0;
>> +             dprintk("%s Registering id:%u name:%s\n", __func__, ld_type->id,
>> +                     ld_type->name);
>> +     } else
>> +             printk(KERN_ERR "%s Module with id %d already loaded!\n",
>> +                     __func__, ld_type->id);
>> +     spin_unlock(&pnfs_spinlock);
>
> In other places we generally don't bother doing these checks, so for
> pnfs with just three clients it's even more poinless.
>

But they certainly don't hurt.  If it would prevent inclusion in
mainline, I'll take out the check, but intentionally introducing
potential bugs just because its done in other places seems like a poor
argument.

Fred

> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure
  2010-09-13 14:37           ` Benny Halevy
@ 2010-09-13 16:55             ` Trond Myklebust
  0 siblings, 0 replies; 55+ messages in thread
From: Trond Myklebust @ 2010-09-13 16:55 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Christoph Hellwig, Fred Isaman, linux-nfs

On Mon, 2010-09-13 at 16:37 +0200, Benny Halevy wrote:
> On 2010-09-13 14:29, Trond Myklebust wrote:
> > On Mon, 2010-09-13 at 13:24 +0200, Benny Halevy wrote:
> >> On 2010-09-11 03:07, Trond Myklebust wrote:
> >>> On Fri, 2010-09-10 at 19:58 -0400, Christoph Hellwig wrote:
> >>>>> +EXPORT_SYMBOL(pnfs_register_layoutdriver);
> >>>>
> >>>> Al exports from nfs.ko needs to be _GPL - this is in no way a public
> >>>> API, just an internal subdivision of the nfs client.
> >>>
> >>> ACK. We're not committing to supporting a stable ABI here in any way,
> >>> shape or form...
> >>
> >> OK, not yet.
> >>
> >> However, on the longer run I'd like us to consider formalizing
> >> the kABI for non-GPLed layout drivers.
> >>
> >> I think that this is a great selling point as it fully materializes the
> >> extensibility of the layout-type / layout-driver design model.
> > 
> > No. I'm not committing to a kabi, ever...
> 
> Care to explain your reasoning in some more details?

It has already been laid out in Documentation/stable_api_nonsense.txt.
It boils down to the following:

 1) I don't _ever_ want to have to deal with debugging problems that
might involve binary-only modules.
 2) I want to be able to change interfaces in order to optimise/improve
the code as needed without having to check in with a bunch of
out-of-tree maintainers about whether or not it is 'OK' to do so.
 3) I don't see a need for linking to non-GPLed code.

Trond

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2010-09-13 16:56 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-02 18:00 [PATCH 00/13] RFC: pnfs: LAYOUTGET/DEVINFO submission Fred Isaman
2010-09-02 18:00 ` [PATCH 01/13] NFSD: remove duplicate NFS4_STATEID_SIZE Fred Isaman
2010-09-02 18:00 ` [PATCH 02/13] SUNRPC: define xdr_decode_opaque_fixed Fred Isaman
2010-09-02 18:00 ` [PATCH 03/13] RFC: pnfsd, pnfs: protocol level pnfs constants Fred Isaman
2010-09-02 18:00 ` [PATCH 04/13] RFC: nfs: change stateid to be a union Fred Isaman
2010-09-02 18:00 ` [PATCH 05/13] RFC: nfs: ask for layouttypes during fsinfo call Fred Isaman
2010-09-02 18:00 ` [PATCH 06/13] RFC: nfs: set layout driver Fred Isaman
2010-09-02 18:00 ` [PATCH 07/13] RFC: pnfs: full mount/umount infrastructure Fred Isaman
2010-09-10 19:23   ` Trond Myklebust
     [not found]     ` <1284146604.10062.68.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2010-09-10 20:53       ` Fred Isaman
2010-09-13 11:06     ` Boaz Harrosh
2010-09-13 14:44       ` Christoph Hellwig
2010-09-13 15:14         ` Boaz Harrosh
2010-09-13 11:20     ` Benny Halevy
2010-09-10 23:58   ` Christoph Hellwig
2010-09-11  0:07     ` Trond Myklebust
2010-09-13 11:24       ` Benny Halevy
2010-09-13 12:29         ` Trond Myklebust
2010-09-13 14:37           ` Benny Halevy
2010-09-13 16:55             ` Trond Myklebust
2010-09-13 14:28         ` Christoph Hellwig
2010-09-13 14:39           ` Benny Halevy
2010-09-13 15:07   ` Christoph Hellwig
2010-09-13 15:27     ` Fred Isaman
2010-09-02 18:00 ` [PATCH 08/13] RFC: pnfs: filelayout: introduce minimal file layout driver Fred Isaman
2010-09-10 19:31   ` Trond Myklebust
2010-09-10 21:11     ` Fred Isaman
2010-09-10 22:37       ` Trond Myklebust
2010-09-13 10:32         ` Benny Halevy
2010-09-13 13:01           ` Fred Isaman
     [not found]             ` <AANLkTimONZfA6ZX4xtzbmy0NdfEtbwMAi+__PhFYznTn-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-09-13 14:23               ` Benny Halevy
2010-09-13 14:48         ` Christoph Hellwig
2010-09-13 10:16       ` Benny Halevy
2010-09-10 23:56     ` Christoph Hellwig
2010-09-11  0:03       ` Trond Myklebust
2010-09-11  0:07         ` Christoph Hellwig
2010-09-11  0:13           ` Trond Myklebust
2010-09-13 11:28             ` Benny Halevy
2010-09-13 15:08   ` Christoph Hellwig
2010-09-13 15:16     ` Fred Isaman
2010-09-02 18:00 ` [PATCH 09/13] RFC: nfs: create and destroy inode's layout cache Fred Isaman
2010-09-10 19:43   ` Trond Myklebust
     [not found]     ` <1284147785.10062.80.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2010-09-10 21:13       ` Fred Isaman
2010-09-13 11:32     ` Benny Halevy
2010-09-02 18:00 ` [PATCH 10/13] RFC: nfs: client needs to maintain list of inodes with active layouts Fred Isaman
2010-09-10 19:59   ` Trond Myklebust
     [not found]     ` <1284148768.10062.94.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2010-09-10 21:18       ` Fred Isaman
2010-09-02 18:00 ` [PATCH 11/13] RFC: nfs: retry on certain pnfs errors Fred Isaman
2010-09-02 18:00 ` [PATCH 12/13] RFC: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure Fred Isaman
2010-09-10 20:11   ` Trond Myklebust
2010-09-10 21:47     ` Fred Isaman
2010-09-10 22:43       ` Trond Myklebust
2010-09-13 14:16       ` Benny Halevy
2010-09-02 18:00 ` [PATCH 13/13] RFC: pnfs: filelayout: add driver's " Fred Isaman
2010-09-10 20:33   ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.