linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/34] pnfs block layout driver based on v3.0-rc2
@ 2011-06-12 23:43 Jim Rees
  2011-06-12 23:43 ` [PATCH 01/34] pnfs: GETDEVICELIST Jim Rees
                   ` (33 more replies)
  0 siblings, 34 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:43 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

For review only.  This patch set adds the pnfs block layout driver to the
nfs client.  It is roughly equivalent to the 88-patch set I sent earlier,
but is re-organized and based on v3.0-rc2.

There are several style problems, mostly white space and line
length problems, and I'll deal with those.  I'm interested in comments on
the structure of the patch set, and of course the logic and functionality.

Date: Sun, 12 Jun 2011 19:43:45 -0400

Among these patches, the first 6 patches are generic layer change that is
needed for block layout driver.

Patch 7~32 are block driver code.

Patch 33 is the layout prefetch patch, which Benny agrees to take in for BAT
with the default value set to 0.

Patch 34 is the fix for layoutcommit vs. update_inode. Trond hasn't merged
it so I put it here as block driver need it to pass cthon hole file test.

Although the code is almost same as previous one, we did not have much time
fully testing it. It seems increased segment number (by reducing the layout
segment size) will sometimes cause client hang in cthon test. There may be
some bugs in either generic layer segment management or block driver extent
management (we already fixed some bugs in both parts). We will do more
investigation on it tomorrow.

Andy Adamson (1):
  pnfs: GETDEVICELIST

Benny Halevy (1):
  pnfs: add set-clear layoutdriver interface

Fred (1):
  pnfsblock: find_get_extent

Fred Isaman (21):
  pnfsblock: define PNFS_BLOCK Kconfig option
  pnfsblock: blocklayout stub
  pnfsblock: layout alloc and free
  Add support for simple rpc pipefs
  pnfsblock: basic extent code
  pnfsblock: lseg alloc and free
  pnfsblock: merge extents
  pnfsblock: call and parse getdevicelist
  pnfsblock: allow use of PG_owner_priv_1 flag
  pnfsblock: xdr decode pnfs_block_layout4
  pnfsblock: SPLITME: add extent manipulation functions
  pnfsblock: merge rw extents
  pnfsblock: encode_layoutcommit
  pnfsblock: cleanup_layoutcommit
  pnfsblock: bl_read_pagelist
  pnfsblock: write_begin
  pnfsblock: write_end
  pnfsblock: write_end_cleanup
  pnfsblock: bl_write_pagelist support functions
  pnfsblock: bl_write_pagelist
  pnfsblock: note written INVAL areas for layoutcommit

Jim Rees (3):
  pnfs-block: Add block device discovery pipe
  pnfsblock: add device operations
  pnfsblock: remove device operations

Peng Tao (6):
  pnfs: let layoutcommit code handle multiple segments
  pnfs: hook nfs_write_begin/end to allow layout driver manipulation
  pnfs: ask for layout_blksize and save it in nfs_server
  pnfs: cleanup_layoutcommit
  Add configurable prefetch size for layoutget
  NFS41: do not update isize if inode needs layoutcommit

Zhang Jingwang (1):
  pnfsblock: Implement release_inval_marks

 fs/nfs/Kconfig                                   |    8 +
 fs/nfs/Makefile                                  |    1 +
 fs/nfs/blocklayout/Makefile                      |    5 +
 fs/nfs/blocklayout/block-device-discovery-pipe.c |   66 ++
 fs/nfs/blocklayout/blocklayout.c                 | 1089 ++++++++++++++++++++++
 fs/nfs/blocklayout/blocklayout.h                 |  287 ++++++
 fs/nfs/blocklayout/blocklayoutdev.c              |  346 +++++++
 fs/nfs/blocklayout/blocklayoutdm.c               |  120 +++
 fs/nfs/blocklayout/extents.c                     |  941 +++++++++++++++++++
 fs/nfs/client.c                                  |    8 +-
 fs/nfs/file.c                                    |   26 +-
 fs/nfs/inode.c                                   |    3 +-
 fs/nfs/nfs4_fs.h                                 |    2 +-
 fs/nfs/nfs4proc.c                                |   53 +-
 fs/nfs/nfs4xdr.c                                 |  232 +++++-
 fs/nfs/pnfs.c                                    |  105 ++-
 fs/nfs/pnfs.h                                    |  141 +++-
 fs/nfs/sysctl.c                                  |   10 +
 fs/nfs/write.c                                   |   12 +-
 include/linux/nfs4.h                             |    1 +
 include/linux/nfs_fs.h                           |    3 +-
 include/linux/nfs_fs_sb.h                        |    4 +-
 include/linux/nfs_xdr.h                          |   15 +-
 include/linux/sunrpc/simple_rpc_pipefs.h         |  105 +++
 net/sunrpc/simple_rpc_pipefs.c                   |  423 +++++++++
 25 files changed, 3962 insertions(+), 44 deletions(-)
 create mode 100644 fs/nfs/blocklayout/Makefile
 create mode 100644 fs/nfs/blocklayout/block-device-discovery-pipe.c
 create mode 100644 fs/nfs/blocklayout/blocklayout.c
 create mode 100644 fs/nfs/blocklayout/blocklayout.h
 create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c
 create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c
 create mode 100644 fs/nfs/blocklayout/extents.c
 create mode 100644 include/linux/sunrpc/simple_rpc_pipefs.h
 create mode 100644 net/sunrpc/simple_rpc_pipefs.c

-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/34] pnfs: GETDEVICELIST
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
@ 2011-06-12 23:43 ` Jim Rees
  2011-06-12 23:43 ` [PATCH 02/34] pnfs: add set-clear layoutdriver interface Jim Rees
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:43 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Andy Adamson <andros@netapp.com>

The block driver uses GETDEVICELIST

Signed-off-by: Andy Adamson <andros@netapp.com>
[pass struct nfs_server * to getdevicelist]
[get machince creds for getdevicelist]
[fix getdevicelist decode sizing]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
 fs/nfs/nfs4proc.c       |   47 +++++++++++++++++
 fs/nfs/nfs4xdr.c        |  128 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/pnfs.h           |   12 ++++
 include/linux/nfs4.h    |    1 +
 include/linux/nfs_xdr.h |   11 ++++
 5 files changed, 199 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index d2c4b59..4a5ad93 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5763,6 +5763,53 @@ int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp)
 	return status;
 }
 
+/*
+ * Retrieve the list of Data Server devices from the MDS.
+ */
+static int _nfs4_getdevicelist(struct nfs_server *server,
+				    const struct nfs_fh *fh,
+				    struct pnfs_devicelist *devlist)
+{
+	struct nfs4_getdevicelist_args args = {
+		.fh = fh,
+		.layoutclass = server->pnfs_curr_ld->id,
+	};
+	struct nfs4_getdevicelist_res res = {
+		.devlist = devlist,
+	};
+	struct rpc_message msg = {
+		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_GETDEVICELIST],
+		.rpc_argp = &args,
+		.rpc_resp = &res,
+	};
+	int status;
+
+	dprintk("--> %s\n", __func__);
+	status = nfs4_call_sync(server->client, server, &msg, &args.seq_args, &res.seq_res, 0);
+	dprintk("<-- %s status=%d\n", __func__, status);
+	return status;
+}
+
+int nfs4_proc_getdevicelist(struct nfs_server *server,
+			    const struct nfs_fh *fh,
+			    struct pnfs_devicelist *devlist)
+{
+	struct nfs4_exception exception = { };
+	int err;
+
+	do {
+		err = nfs4_handle_exception(server,
+				_nfs4_getdevicelist(server, fh, devlist),
+				&exception);
+	} while (exception.retry);
+
+	dprintk("%s: err=%d, num_devs=%u\n", __func__,
+		err, devlist->num_devs);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(nfs4_proc_getdevicelist);
+
 static int
 _nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
 {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index d869a5e..3620c45 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -314,6 +314,17 @@ static int nfs4_stat_to_errno(int);
 				XDR_QUADLEN(NFS4_MAX_SESSIONID_LEN) + 5)
 #define encode_reclaim_complete_maxsz	(op_encode_hdr_maxsz + 4)
 #define decode_reclaim_complete_maxsz	(op_decode_hdr_maxsz + 4)
+#define encode_getdevicelist_maxsz (op_encode_hdr_maxsz + 4 + \
+				encode_verifier_maxsz)
+#define decode_getdevicelist_maxsz (op_decode_hdr_maxsz + \
+				2 /* nfs_cookie4 gdlr_cookie */ + \
+				decode_verifier_maxsz \
+				  /* verifier4 gdlr_verifier */ + \
+				1 /* gdlr_deviceid_list count */ + \
+				XDR_QUADLEN(NFS4_PNFS_GETDEVLIST_MAXNUM * \
+					    NFS4_DEVICEID4_SIZE) \
+				  /* gdlr_deviceid_list */ + \
+				1 /* bool gdlr_eof */)
 #define encode_getdeviceinfo_maxsz (op_encode_hdr_maxsz + 4 + \
 				XDR_QUADLEN(NFS4_DEVICEID4_SIZE))
 #define decode_getdeviceinfo_maxsz (op_decode_hdr_maxsz + \
@@ -740,6 +751,14 @@ static int nfs4_stat_to_errno(int);
 #define NFS4_dec_reclaim_complete_sz	(compound_decode_hdr_maxsz + \
 					 decode_sequence_maxsz + \
 					 decode_reclaim_complete_maxsz)
+#define NFS4_enc_getdevicelist_sz (compound_encode_hdr_maxsz + \
+				encode_sequence_maxsz + \
+				encode_putfh_maxsz + \
+				encode_getdevicelist_maxsz)
+#define NFS4_dec_getdevicelist_sz (compound_decode_hdr_maxsz + \
+				decode_sequence_maxsz + \
+				decode_putfh_maxsz + \
+				decode_getdevicelist_maxsz)
 #define NFS4_enc_getdeviceinfo_sz (compound_encode_hdr_maxsz +    \
 				encode_sequence_maxsz +\
 				encode_getdeviceinfo_maxsz)
@@ -1827,6 +1846,26 @@ static void encode_sequence(struct xdr_stream *xdr,
 
 #ifdef CONFIG_NFS_V4_1
 static void
+encode_getdevicelist(struct xdr_stream *xdr,
+		     const struct nfs4_getdevicelist_args *args,
+		     struct compound_hdr *hdr)
+{
+	__be32 *p;
+	nfs4_verifier dummy = {
+		.data = "dummmmmy",
+	};
+
+	p = reserve_space(xdr, 20);
+	*p++ = cpu_to_be32(OP_GETDEVICELIST);
+	*p++ = cpu_to_be32(args->layoutclass);
+	*p++ = cpu_to_be32(NFS4_PNFS_GETDEVLIST_MAXNUM);
+	xdr_encode_hyper(p, 0ULL);                          /* cookie */
+	encode_nfs4_verifier(xdr, &dummy);
+	hdr->nops++;
+	hdr->replen += decode_getdevicelist_maxsz;
+}
+
+static void
 encode_getdeviceinfo(struct xdr_stream *xdr,
 		     const struct nfs4_getdeviceinfo_args *args,
 		     struct compound_hdr *hdr)
@@ -2707,6 +2746,24 @@ static void nfs4_xdr_enc_reclaim_complete(struct rpc_rqst *req,
 }
 
 /*
+ * Encode GETDEVICELIST request
+ */
+static void nfs4_xdr_enc_getdevicelist(struct rpc_rqst *req,
+				       struct xdr_stream *xdr,
+				       struct nfs4_getdevicelist_args *args)
+{
+	struct compound_hdr hdr = {
+		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
+	};
+
+	encode_compound_hdr(xdr, req, &hdr);
+	encode_sequence(xdr, &args->seq_args, &hdr);
+	encode_putfh(xdr, args->fh, &hdr);
+	encode_getdevicelist(xdr, args, &hdr);
+	encode_nops(&hdr);
+}
+
+/*
  * Encode GETDEVICEINFO request
  */
 static void nfs4_xdr_enc_getdeviceinfo(struct rpc_rqst *req,
@@ -5139,6 +5196,50 @@ out_overflow:
 }
 
 #if defined(CONFIG_NFS_V4_1)
+/*
+ * TODO: Need to handle case when EOF != true;
+ */
+static int decode_getdevicelist(struct xdr_stream *xdr,
+				struct pnfs_devicelist *res)
+{
+	__be32 *p;
+	int status, i;
+	struct nfs_writeverf verftemp;
+
+	status = decode_op_hdr(xdr, OP_GETDEVICELIST);
+	if (status)
+		return status;
+
+	p = xdr_inline_decode(xdr, 8 + 8 + 4);
+	if (unlikely(!p))
+		goto out_overflow;
+
+	/* TODO: Skip cookie for now */
+	p += 2;
+
+	/* Read verifier */
+	p = xdr_decode_opaque_fixed(p, verftemp.verifier, 8);
+
+	res->num_devs = be32_to_cpup(p);
+
+	dprintk("%s: num_dev %d\n", __func__, res->num_devs);
+
+	if (res->num_devs > NFS4_PNFS_GETDEVLIST_MAXNUM)
+		return -NFS4ERR_REP_TOO_BIG;
+
+	p = xdr_inline_decode(xdr,
+			      res->num_devs * NFS4_DEVICEID4_SIZE + 4);
+	if (unlikely(!p))
+		goto out_overflow;
+	for (i = 0; i < res->num_devs; i++)
+		p = xdr_decode_opaque_fixed(p, res->dev_id[i].data,
+					    NFS4_DEVICEID4_SIZE);
+	res->eof = be32_to_cpup(p);
+	return 0;
+out_overflow:
+	print_overflow_msg(__func__, xdr);
+	return -EIO;
+}
 
 static int decode_getdeviceinfo(struct xdr_stream *xdr,
 				struct pnfs_device *pdev)
@@ -6364,6 +6465,32 @@ static int nfs4_xdr_dec_reclaim_complete(struct rpc_rqst *rqstp,
 }
 
 /*
+ * Decode GETDEVICELIST response
+ */
+static int nfs4_xdr_dec_getdevicelist(struct rpc_rqst *rqstp,
+				      struct xdr_stream *xdr,
+				      struct nfs4_getdevicelist_res *res)
+{
+	struct compound_hdr hdr;
+	int status;
+
+	dprintk("encoding getdevicelist!\n");
+
+	status = decode_compound_hdr(xdr, &hdr);
+	if (status != 0)
+		goto out;
+	status = decode_sequence(xdr, &res->seq_res, rqstp);
+	if (status != 0)
+		goto out;
+	status = decode_putfh(xdr);
+	if (status != 0)
+		goto out;
+	status = decode_getdevicelist(xdr, res->devlist);
+out:
+	return status;
+}
+
+/*
  * Decode GETDEVINFO response
  */
 static int nfs4_xdr_dec_getdeviceinfo(struct rpc_rqst *rqstp,
@@ -6657,6 +6784,7 @@ struct rpc_procinfo	nfs4_procedures[] = {
 	PROC(SEQUENCE,		enc_sequence,		dec_sequence),
 	PROC(GET_LEASE_TIME,	enc_get_lease_time,	dec_get_lease_time),
 	PROC(RECLAIM_COMPLETE,	enc_reclaim_complete,	dec_reclaim_complete),
+	PROC(GETDEVICELIST,	enc_getdevicelist,	dec_getdevicelist),
 	PROC(GETDEVICEINFO,	enc_getdeviceinfo,	dec_getdeviceinfo),
 	PROC(LAYOUTGET,		enc_layoutget,		dec_layoutget),
 	PROC(LAYOUTCOMMIT,	enc_layoutcommit,	dec_layoutcommit),
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 48d0a8e..6113fc6 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -132,14 +132,26 @@ struct pnfs_device {
 	unsigned int  layout_type;
 	unsigned int  mincount;
 	struct page **pages;
+	void          *area;
 	unsigned int  pgbase;
 	unsigned int  pglen;
 };
 
+#define NFS4_PNFS_GETDEVLIST_MAXNUM 16
+
+struct pnfs_devicelist {
+	unsigned int		eof;
+	unsigned int		num_devs;
+	struct nfs4_deviceid	dev_id[NFS4_PNFS_GETDEVLIST_MAXNUM];
+};
+
 extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
 extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
 
 /* nfs4proc.c */
+extern int nfs4_proc_getdevicelist(struct nfs_server *server,
+				   const struct nfs_fh *fh,
+				   struct pnfs_devicelist *devlist);
 extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
 				   struct pnfs_device *dev);
 extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 504b289..7915d41 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -560,6 +560,7 @@ enum {
 	NFSPROC4_CLNT_GET_LEASE_TIME,
 	NFSPROC4_CLNT_RECLAIM_COMPLETE,
 	NFSPROC4_CLNT_LAYOUTGET,
+	NFSPROC4_CLNT_GETDEVICELIST,
 	NFSPROC4_CLNT_GETDEVICEINFO,
 	NFSPROC4_CLNT_LAYOUTCOMMIT,
 	NFSPROC4_CLNT_LAYOUTRETURN,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 5e8444a..00442f5 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -236,6 +236,17 @@ struct nfs4_layoutget {
 	gfp_t gfp_flags;
 };
 
+struct nfs4_getdevicelist_args {
+	const struct nfs_fh *fh;
+	u32 layoutclass;
+	struct nfs4_sequence_args seq_args;
+};
+
+struct nfs4_getdevicelist_res {
+	struct pnfs_devicelist *devlist;
+	struct nfs4_sequence_res seq_res;
+};
+
 struct nfs4_getdeviceinfo_args {
 	struct pnfs_device *pdev;
 	struct nfs4_sequence_args seq_args;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/34] pnfs: add set-clear layoutdriver interface
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
  2011-06-12 23:43 ` [PATCH 01/34] pnfs: GETDEVICELIST Jim Rees
@ 2011-06-12 23:43 ` Jim Rees
  2011-06-12 23:43 ` [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments Jim Rees
                   ` (31 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:43 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Benny Halevy <bhalevy@panasas.com>

To allow layout driver to issue getdevicelist at mount time, and clean up
at umount time.

[fixup non NFS_V4_1 set_pnfs_layoutdriver definition]
[pnfs: pass mntfh down the init_pnfs path]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/client.c |    7 ++++---
 fs/nfs/pnfs.c   |   15 +++++++++++++--
 fs/nfs/pnfs.h   |    8 ++++++--
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index b3dc2b8..6bdb7da0 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -906,7 +906,8 @@ error:
 /*
  * Load up the server record from information gained in an fsinfo record
  */
-static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *fsinfo)
+static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fh *mntfh,
+				  struct nfs_fsinfo *fsinfo)
 {
 	unsigned long max_rpc_payload;
 
@@ -936,7 +937,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *
 	if (server->wsize > NFS_MAX_FILE_IO_SIZE)
 		server->wsize = NFS_MAX_FILE_IO_SIZE;
 	server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	set_pnfs_layoutdriver(server, fsinfo->layouttype);
+	set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);
 
 	server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);
 
@@ -982,7 +983,7 @@ static int nfs_probe_fsinfo(struct nfs_server *server, struct nfs_fh *mntfh, str
 	if (error < 0)
 		goto out_error;
 
-	nfs_server_set_fsinfo(server, &fsinfo);
+	nfs_server_set_fsinfo(server, mntfh, &fsinfo);
 
 	/* Get some general file system info */
 	if (server->namelen == 0) {
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 8c1309d..e3d618b 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -75,8 +75,11 @@ find_pnfs_driver(u32 id)
 void
 unset_pnfs_layoutdriver(struct nfs_server *nfss)
 {
-	if (nfss->pnfs_curr_ld)
+	if (nfss->pnfs_curr_ld) {
+		if (nfss->pnfs_curr_ld->clear_layoutdriver)
+			nfss->pnfs_curr_ld->clear_layoutdriver(nfss);
 		module_put(nfss->pnfs_curr_ld->owner);
+	}
 	nfss->pnfs_curr_ld = NULL;
 }
 
@@ -87,7 +90,8 @@ unset_pnfs_layoutdriver(struct nfs_server *nfss)
  * @id layout type. Zero (illegal layout type) indicates pNFS not in use.
  */
 void
-set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
+set_pnfs_layoutdriver(struct nfs_server *server, const struct nfs_fh *mntfh,
+		      u32 id)
 {
 	struct pnfs_layoutdriver_type *ld_type = NULL;
 
@@ -114,6 +118,13 @@ set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
 		goto out_no_driver;
 	}
 	server->pnfs_curr_ld = ld_type;
+        if (ld_type->set_layoutdriver && ld_type->set_layoutdriver(server, mntfh)) {
+                printk(KERN_ERR
+                       "%s: Error initializing mount point for layout driver %u.\n",
+                       __func__, id);
+                module_put(ld_type->owner);
+                goto out_no_driver;
+        }
 
 	dprintk("%s: pNFS module for %u set\n", __func__, id);
 	return;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 6113fc6..b071b56 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -80,6 +80,9 @@ struct pnfs_layoutdriver_type {
 	struct module *owner;
 	unsigned flags;
 
+	int (*set_layoutdriver) (struct nfs_server *, const struct nfs_fh *);
+	int (*clear_layoutdriver) (struct nfs_server *);
+
 	struct pnfs_layout_hdr * (*alloc_layout_hdr) (struct inode *inode, gfp_t gfp_flags);
 	void (*free_layout_hdr) (struct pnfs_layout_hdr *);
 
@@ -164,7 +167,7 @@ struct pnfs_layout_segment *
 pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
 		   loff_t pos, u64 count, enum pnfs_iomode access_type,
 		   gfp_t gfp_flags);
-void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
+void set_pnfs_layoutdriver(struct nfs_server *, const struct nfs_fh *, u32);
 void unset_pnfs_layoutdriver(struct nfs_server *);
 enum pnfs_try_status pnfs_try_to_write_data(struct nfs_write_data *,
 					     const struct rpc_call_ops *, int);
@@ -388,7 +391,8 @@ pnfs_roc_drain(struct inode *ino, u32 *barrier)
 	return false;
 }
 
-static inline void set_pnfs_layoutdriver(struct nfs_server *s, u32 id)
+static inline void set_pnfs_layoutdriver(struct nfs_server *s,
+					 const struct nfs_fh *mntfh, u32 id);
 {
 }
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
  2011-06-12 23:43 ` [PATCH 01/34] pnfs: GETDEVICELIST Jim Rees
  2011-06-12 23:43 ` [PATCH 02/34] pnfs: add set-clear layoutdriver interface Jim Rees
@ 2011-06-12 23:43 ` Jim Rees
  2011-06-13 14:36   ` Fred Isaman
  2011-06-12 23:43 ` [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation Jim Rees
                   ` (30 subsequent siblings)
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:43 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Peng Tao <bergwolf@gmail.com>

Some layout driver like block will have multiple segments.
Generic code should be able to handle it.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/pnfs.c |   17 +++++++++++++----
 fs/nfs/pnfs.h |    1 +
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e3d618b..f03a5e0 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -892,7 +892,7 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
 	dprintk("%s:Begin\n", __func__);
 
 	assert_spin_locked(&lo->plh_inode->i_lock);
-	list_for_each_entry(lseg, &lo->plh_segs, pls_list) {
+	list_for_each_entry_reverse(lseg, &lo->plh_segs, pls_list) {
 		if (test_bit(NFS_LSEG_VALID, &lseg->pls_flags) &&
 		    is_matching_lseg(&lseg->pls_range, range)) {
 			ret = get_lseg(lseg);
@@ -1193,10 +1193,18 @@ pnfs_try_to_read_data(struct nfs_read_data *rdata,
 static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
 {
 	struct pnfs_layout_segment *lseg, *rv = NULL;
+	loff_t max_pos = 0;
+
+	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
+		if (lseg->pls_range.iomode == IOMODE_RW) {
+			if (max_pos < lseg->pls_end_pos)
+				max_pos = lseg->pls_end_pos;
+			if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags))
+				rv = lseg;
+		}
+	}
+	rv->pls_end_pos = max_pos;
 
-	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
-		if (lseg->pls_range.iomode == IOMODE_RW)
-			rv = lseg;
 	return rv;
 }
 
@@ -1211,6 +1219,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 	if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
 		/* references matched in nfs4_layoutcommit_release */
 		get_lseg(wdata->lseg);
+		set_bit(NFS_LSEG_LAYOUTCOMMIT, &wdata->lseg->pls_flags);
 		wdata->lseg->pls_lc_cred =
 			get_rpccred(wdata->args.context->state->owner->so_cred);
 		mark_as_dirty = true;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index b071b56..a3fc0f2 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -36,6 +36,7 @@
 enum {
 	NFS_LSEG_VALID = 0,	/* cleared when lseg is recalled/returned */
 	NFS_LSEG_ROC,		/* roc bit received from server */
+	NFS_LSEG_LAYOUTCOMMIT,	/* layoutcommit bit set for layoutcommit */
 };
 
 struct pnfs_layout_segment {
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (2 preceding siblings ...)
  2011-06-12 23:43 ` [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments Jim Rees
@ 2011-06-12 23:43 ` Jim Rees
  2011-06-13 14:44   ` Fred Isaman
  2011-06-12 23:43 ` [PATCH 05/34] pnfs: ask for layout_blksize and save it in nfs_server Jim Rees
                   ` (29 subsequent siblings)
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:43 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Peng Tao <bergwolf@gmail.com>

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Reported-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
 fs/nfs/file.c          |   26 ++++++++++-
 fs/nfs/pnfs.c          |   41 +++++++++++++++++
 fs/nfs/pnfs.h          |  115 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/write.c         |   12 +++--
 include/linux/nfs_fs.h |    3 +-
 5 files changed, 189 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 2f093ed..1768762 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -384,12 +384,15 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	struct page *page;
 	int once_thru = 0;
+	struct pnfs_layout_segment *lseg;
 
 	dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
 		file->f_path.dentry->d_parent->d_name.name,
 		file->f_path.dentry->d_name.name,
 		mapping->host->i_ino, len, (long long) pos);
-
+	lseg = pnfs_update_layout(mapping->host,
+				  nfs_file_open_context(file),
+				  pos, len, IOMODE_RW, GFP_NOFS);
 start:
 	/*
 	 * Prevent starvation issues if someone is doing a consistency
@@ -409,6 +412,9 @@ start:
 	if (ret) {
 		unlock_page(page);
 		page_cache_release(page);
+		*pagep = NULL;
+		*fsdata = NULL;
+		goto out;
 	} else if (!once_thru &&
 		   nfs_want_read_modify_write(file, page, pos, len)) {
 		once_thru = 1;
@@ -417,6 +423,12 @@ start:
 		if (!ret)
 			goto start;
 	}
+	ret = pnfs_write_begin(file, page, pos, len, lseg, fsdata);
+ out:
+	if (ret) {
+		put_lseg(lseg);
+		*fsdata = NULL;
+	}
 	return ret;
 }
 
@@ -426,6 +438,7 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
 {
 	unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
 	int status;
+	struct pnfs_layout_segment *lseg;
 
 	dfprintk(PAGECACHE, "NFS: write_end(%s/%s(%ld), %u@%lld)\n",
 		file->f_path.dentry->d_parent->d_name.name,
@@ -452,10 +465,17 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
 			zero_user_segment(page, pglen, PAGE_CACHE_SIZE);
 	}
 
-	status = nfs_updatepage(file, page, offset, copied);
+	lseg = nfs4_pull_lseg_from_fsdata(file, fsdata);
+	status = pnfs_write_end(file, page, pos, len, copied, lseg);
+	if (status)
+		goto out;
+	status = nfs_updatepage(file, page, offset, copied, lseg, fsdata);
 
+out:
 	unlock_page(page);
 	page_cache_release(page);
+	pnfs_write_end_cleanup(file, fsdata);
+	put_lseg(lseg);
 
 	if (status < 0)
 		return status;
@@ -577,7 +597,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	ret = VM_FAULT_LOCKED;
 	if (nfs_flush_incompatible(filp, page) == 0 &&
-	    nfs_updatepage(filp, page, 0, pagelen) == 0)
+	    nfs_updatepage(filp, page, 0, pagelen, NULL, NULL) == 0)
 		goto out;
 
 	ret = VM_FAULT_SIGBUS;
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index f03a5e0..e693718 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1138,6 +1138,41 @@ pnfs_try_to_write_data(struct nfs_write_data *wdata,
 }
 
 /*
+ * This gives the layout driver an opportunity to read in page "around"
+ * the data to be written.  It returns 0 on success, otherwise an error code
+ * which will either be passed up to user, or ignored if
+ * some previous part of write succeeded.
+ * Note the range [pos, pos+len-1] is entirely within the page.
+ */
+int _pnfs_write_begin(struct inode *inode, struct page *page,
+		      loff_t pos, unsigned len,
+		      struct pnfs_layout_segment *lseg,
+		      struct pnfs_fsdata **fsdata)
+{
+	struct pnfs_fsdata *data;
+	int status = 0;
+
+	dprintk("--> %s: pos=%llu len=%u\n",
+		__func__, (unsigned long long)pos, len);
+	data = kzalloc(sizeof(struct pnfs_fsdata), GFP_KERNEL);
+	if (!data) {
+		status = -ENOMEM;
+		goto out;
+	}
+	data->lseg = lseg; /* refcount passed into data to be managed there */
+	status = NFS_SERVER(inode)->pnfs_curr_ld->write_begin(
+						lseg, page, pos, len, data);
+	if (status) {
+		kfree(data);
+		data = NULL;
+	}
+out:
+	*fsdata = data;
+	dprintk("<-- %s: status=%d\n", __func__, status);
+	return status;
+}
+
+/*
  * Called by non rpc-based layout drivers
  */
 int
@@ -1237,6 +1272,12 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 }
 EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
 
+void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
+{
+	/* lseg refcounting handled directly in nfs_write_end */
+	kfree(fsdata);
+}
+
 /*
  * For the LAYOUT4_NFSV4_1_FILES layout type, NFS_DATA_SYNC WRITEs and
  * NFS_UNSTABLE WRITEs with a COMMIT to data servers must store enough
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index a3fc0f2..525ec55 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -54,6 +54,12 @@ enum pnfs_try_status {
 	PNFS_NOT_ATTEMPTED = 1,
 };
 
+struct pnfs_fsdata {
+	struct pnfs_layout_segment *lseg;
+	int bypass_eof;
+	void *private;
+};
+
 #ifdef CONFIG_NFS_V4_1
 
 #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
@@ -106,6 +112,14 @@ struct pnfs_layoutdriver_type {
 	 */
 	enum pnfs_try_status (*read_pagelist) (struct nfs_read_data *nfs_data);
 	enum pnfs_try_status (*write_pagelist) (struct nfs_write_data *nfs_data, int how);
+	int (*write_begin) (struct pnfs_layout_segment *lseg, struct page *page,
+			    loff_t pos, unsigned count,
+			    struct pnfs_fsdata *fsdata);
+	int (*write_end)(struct inode *inode, struct page *page, loff_t pos,
+			 unsigned count, unsigned copied,
+			 struct pnfs_layout_segment *lseg);
+	void (*write_end_cleanup)(struct file *filp,
+				  struct pnfs_fsdata *fsdata);
 
 	void (*free_deviceid_node) (struct nfs4_deviceid_node *);
 
@@ -175,6 +189,7 @@ enum pnfs_try_status pnfs_try_to_write_data(struct nfs_write_data *,
 enum pnfs_try_status pnfs_try_to_read_data(struct nfs_read_data *,
 					    const struct rpc_call_ops *);
 bool pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, struct nfs_page *req);
+void pnfs_free_fsdata(struct pnfs_fsdata *fsdata);
 int pnfs_layout_process(struct nfs4_layoutget *lgp);
 void pnfs_free_lseg_list(struct list_head *tmp_list);
 void pnfs_destroy_layout(struct nfs_inode *);
@@ -186,6 +201,10 @@ void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
 int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
 				  struct pnfs_layout_hdr *lo,
 				  struct nfs4_state *open_state);
+int _pnfs_write_begin(struct inode *inode, struct page *page,
+		      loff_t pos, unsigned len,
+		      struct pnfs_layout_segment *lseg,
+		      struct pnfs_fsdata **fsdata);
 int mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
 				struct list_head *tmp_list,
 				struct pnfs_layout_range *recall_range);
@@ -287,6 +306,13 @@ static inline void pnfs_clear_request_commit(struct nfs_page *req)
 		put_lseg(req->wb_commit_lseg);
 }
 
+static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
+			       struct pnfs_fsdata *fsdata)
+{
+	return !fsdata  || ((struct pnfs_layout_segment *)fsdata == lseg) ||
+		!fsdata->bypass_eof;
+}
+
 /* Should the pNFS client commit and return the layout upon a setattr */
 static inline bool
 pnfs_ld_layoutret_on_setattr(struct inode *inode)
@@ -297,6 +323,49 @@ pnfs_ld_layoutret_on_setattr(struct inode *inode)
 		PNFS_LAYOUTRET_ON_SETATTR;
 }
 
+static inline int pnfs_write_begin(struct file *filp, struct page *page,
+				   loff_t pos, unsigned len,
+				   struct pnfs_layout_segment *lseg,
+				   void **fsdata)
+{
+	struct inode *inode = filp->f_dentry->d_inode;
+	struct nfs_server *nfss = NFS_SERVER(inode);
+	int status = 0;
+
+	*fsdata = lseg;
+	if (lseg && nfss->pnfs_curr_ld->write_begin)
+		status = _pnfs_write_begin(inode, page, pos, len, lseg,
+					   (struct pnfs_fsdata **) fsdata);
+	return status;
+}
+
+/* CAREFUL - what happens if copied < len??? */
+static inline int pnfs_write_end(struct file *filp, struct page *page,
+				 loff_t pos, unsigned len, unsigned copied,
+				 struct pnfs_layout_segment *lseg)
+{
+	struct inode *inode = filp->f_dentry->d_inode;
+	struct nfs_server *nfss = NFS_SERVER(inode);
+
+	if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_end)
+		return nfss->pnfs_curr_ld->write_end(inode, page, pos, len,
+						     copied, lseg);
+	else
+		return 0;
+}
+
+static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
+{
+	struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
+
+	if (fsdata && nfss->pnfs_curr_ld) {
+		if (nfss->pnfs_curr_ld->write_end_cleanup)
+			nfss->pnfs_curr_ld->write_end_cleanup(filp, fsdata);
+		if (nfss->pnfs_curr_ld->write_begin)
+			pnfs_free_fsdata(fsdata);
+	}
+}
+
 static inline int pnfs_return_layout(struct inode *ino)
 {
 	struct nfs_inode *nfsi = NFS_I(ino);
@@ -317,6 +386,19 @@ static inline void pnfs_pageio_init(struct nfs_pageio_descriptor *pgio,
 		pgio->pg_test = ld->pg_test;
 }
 
+static inline struct pnfs_layout_segment *
+nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
+{
+	if (fsdata) {
+		struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
+
+		if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_begin)
+			return ((struct pnfs_fsdata *) fsdata)->lseg;
+		return (struct pnfs_layout_segment *)fsdata;
+	}
+	return NULL;
+}
+
 #else  /* CONFIG_NFS_V4_1 */
 
 static inline void pnfs_destroy_all_layouts(struct nfs_client *clp)
@@ -345,6 +427,12 @@ pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
 	return NULL;
 }
 
+static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
+			       struct pnfs_fsdata *fsdata)
+{
+	return 1;
+}
+
 static inline enum pnfs_try_status
 pnfs_try_to_read_data(struct nfs_read_data *data,
 		      const struct rpc_call_ops *call_ops)
@@ -364,6 +452,26 @@ static inline int pnfs_return_layout(struct inode *ino)
 	return 0;
 }
 
+static inline int pnfs_write_begin(struct file *filp, struct page *page,
+				   loff_t pos, unsigned len,
+				   struct pnfs_layout_segment *lseg,
+				   void **fsdata)
+{
+	*fsdata = NULL;
+	return 0;
+}
+
+static inline int pnfs_write_end(struct file *filp, struct page *page,
+				 loff_t pos, unsigned len, unsigned copied,
+				 struct pnfs_layout_segment *lseg)
+{
+	return 0;
+}
+
+static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
+{
+}
+
 static inline bool
 pnfs_ld_layoutret_on_setattr(struct inode *inode)
 {
@@ -435,6 +543,13 @@ static inline int pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 static inline void nfs4_deviceid_purge_client(struct nfs_client *ncl)
 {
 }
+
+static inline struct pnfs_layout_segment *
+nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_NFS_V4_1 */
 
 #endif /* FS_NFS_PNFS_H */
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index e268e3b..75e2a6b 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -673,7 +673,9 @@ out:
 }
 
 static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
-		unsigned int offset, unsigned int count)
+		unsigned int offset, unsigned int count,
+		struct pnfs_layout_segment *lseg, void *fsdata)
+
 {
 	struct nfs_page	*req;
 
@@ -681,7 +683,8 @@ static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 	/* Update file length */
-	nfs_grow_file(page, offset, count);
+	if (pnfs_grow_ok(lseg, fsdata))
+		nfs_grow_file(page, offset, count);
 	nfs_mark_uptodate(page, req->wb_pgbase, req->wb_bytes);
 	nfs_mark_request_dirty(req);
 	nfs_clear_page_tag_locked(req);
@@ -734,7 +737,8 @@ static int nfs_write_pageuptodate(struct page *page, struct inode *inode)
  * things with a page scheduled for an RPC call (e.g. invalidate it).
  */
 int nfs_updatepage(struct file *file, struct page *page,
-		unsigned int offset, unsigned int count)
+		unsigned int offset, unsigned int count,
+		struct pnfs_layout_segment *lseg, void *fsdata)
 {
 	struct nfs_open_context *ctx = nfs_file_open_context(file);
 	struct inode	*inode = page->mapping->host;
@@ -759,7 +763,7 @@ int nfs_updatepage(struct file *file, struct page *page,
 		offset = 0;
 	}
 
-	status = nfs_writepage_setup(ctx, page, offset, count);
+	status = nfs_writepage_setup(ctx, page, offset, count, lseg, fsdata);
 	if (status < 0)
 		nfs_set_pageerror(page);
 
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 1b93b9c..be1ac1d 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -510,7 +510,8 @@ extern int  nfs_congestion_kb;
 extern int  nfs_writepage(struct page *page, struct writeback_control *wbc);
 extern int  nfs_writepages(struct address_space *, struct writeback_control *);
 extern int  nfs_flush_incompatible(struct file *file, struct page *page);
-extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int);
+extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int,
+			struct pnfs_layout_segment *, void *);
 extern void nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
 
 /*
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/34] pnfs: ask for layout_blksize and save it in nfs_server
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (3 preceding siblings ...)
  2011-06-12 23:43 ` [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation Jim Rees
@ 2011-06-12 23:43 ` Jim Rees
  2011-06-14 15:01   ` Benny Halevy
  2011-06-12 23:44 ` [PATCH 06/34] pnfs: cleanup_layoutcommit Jim Rees
                   ` (28 subsequent siblings)
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:43 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Peng Tao <bergwolf@gmail.com>

Block layout needs it to determine IO size.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Tao Guo <glorioustao@gmail.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
 fs/nfs/client.c           |    1 +
 fs/nfs/nfs4_fs.h          |    2 +-
 fs/nfs/nfs4proc.c         |    5 +-
 fs/nfs/nfs4xdr.c          |  101 +++++++++++++++++++++++++++++++++++++--------
 include/linux/nfs_fs_sb.h |    4 +-
 include/linux/nfs_xdr.h   |    3 +-
 6 files changed, 93 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6bdb7da0..b2c6920 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -937,6 +937,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fh *mntf
 	if (server->wsize > NFS_MAX_FILE_IO_SIZE)
 		server->wsize = NFS_MAX_FILE_IO_SIZE;
 	server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	server->pnfs_blksize = fsinfo->blksize;
 	set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);
 
 	server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index c4a6983..5725a7e 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -315,7 +315,7 @@ extern const struct nfs4_minor_version_ops *nfs_v4_minor_ops[];
 extern const u32 nfs4_fattr_bitmap[2];
 extern const u32 nfs4_statfs_bitmap[2];
 extern const u32 nfs4_pathconf_bitmap[2];
-extern const u32 nfs4_fsinfo_bitmap[2];
+extern const u32 nfs4_fsinfo_bitmap[3];
 extern const u32 nfs4_fs_locations_bitmap[2];
 
 /* nfs4renewd.c */
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 4a5ad93..5246db8 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -137,12 +137,13 @@ const u32 nfs4_pathconf_bitmap[2] = {
 	0
 };
 
-const u32 nfs4_fsinfo_bitmap[2] = { FATTR4_WORD0_MAXFILESIZE
+const u32 nfs4_fsinfo_bitmap[3] = { FATTR4_WORD0_MAXFILESIZE
 			| FATTR4_WORD0_MAXREAD
 			| FATTR4_WORD0_MAXWRITE
 			| FATTR4_WORD0_LEASE_TIME,
 			FATTR4_WORD1_TIME_DELTA
-			| FATTR4_WORD1_FS_LAYOUT_TYPES
+			| FATTR4_WORD1_FS_LAYOUT_TYPES,
+			FATTR4_WORD2_LAYOUT_BLKSIZE
 };
 
 const u32 nfs4_fs_locations_bitmap[2] = {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 3620c45..fdcbd8f 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -91,7 +91,7 @@ static int nfs4_stat_to_errno(int);
 #define encode_getfh_maxsz      (op_encode_hdr_maxsz)
 #define decode_getfh_maxsz      (op_decode_hdr_maxsz + 1 + \
 				((3+NFS4_FHSIZE) >> 2))
-#define nfs4_fattr_bitmap_maxsz 3
+#define nfs4_fattr_bitmap_maxsz 4
 #define encode_getattr_maxsz    (op_encode_hdr_maxsz + nfs4_fattr_bitmap_maxsz)
 #define nfs4_name_maxsz		(1 + ((3 + NFS4_MAXNAMLEN) >> 2))
 #define nfs4_path_maxsz		(1 + ((3 + NFS4_MAXPATHLEN) >> 2))
@@ -113,7 +113,11 @@ static int nfs4_stat_to_errno(int);
 #define encode_restorefh_maxsz  (op_encode_hdr_maxsz)
 #define decode_restorefh_maxsz  (op_decode_hdr_maxsz)
 #define encode_fsinfo_maxsz	(encode_getattr_maxsz)
-#define decode_fsinfo_maxsz	(op_decode_hdr_maxsz + 15)
+/* The 5 accounts for the PNFS attributes, and assumes that at most three
+ * layout types will be returned.
+ */
+#define decode_fsinfo_maxsz	(op_decode_hdr_maxsz + \
+				 nfs4_fattr_bitmap_maxsz + 4 + 8 + 5)
 #define encode_renew_maxsz	(op_encode_hdr_maxsz + 3)
 #define decode_renew_maxsz	(op_decode_hdr_maxsz)
 #define encode_setclientid_maxsz \
@@ -1095,6 +1099,35 @@ static void encode_getattr_two(struct xdr_stream *xdr, uint32_t bm0, uint32_t bm
 	hdr->replen += decode_getattr_maxsz;
 }
 
+static void
+encode_getattr_three(struct xdr_stream *xdr,
+		     uint32_t bm0, uint32_t bm1, uint32_t bm2,
+		     struct compound_hdr *hdr)
+{
+	__be32 *p;
+
+	p = reserve_space(xdr, 4);
+	*p = cpu_to_be32(OP_GETATTR);
+	if (bm2) {
+		p = reserve_space(xdr, 16);
+		*p++ = cpu_to_be32(3);
+		*p++ = cpu_to_be32(bm0);
+		*p++ = cpu_to_be32(bm1);
+		*p = cpu_to_be32(bm2);
+	} else if (bm1) {
+		p = reserve_space(xdr, 12);
+		*p++ = cpu_to_be32(2);
+		*p++ = cpu_to_be32(bm0);
+		*p = cpu_to_be32(bm1);
+	} else {
+		p = reserve_space(xdr, 8);
+		*p++ = cpu_to_be32(1);
+		*p = cpu_to_be32(bm0);
+	}
+	hdr->nops++;
+	hdr->replen += decode_getattr_maxsz;
+}
+
 static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
 {
 	encode_getattr_two(xdr, bitmask[0] & nfs4_fattr_bitmap[0],
@@ -1103,8 +1136,11 @@ static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct c
 
 static void encode_fsinfo(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
 {
-	encode_getattr_two(xdr, bitmask[0] & nfs4_fsinfo_bitmap[0],
-			   bitmask[1] & nfs4_fsinfo_bitmap[1], hdr);
+	encode_getattr_three(xdr,
+			     bitmask[0] & nfs4_fsinfo_bitmap[0],
+			     bitmask[1] & nfs4_fsinfo_bitmap[1],
+			     bitmask[2] & nfs4_fsinfo_bitmap[2],
+			     hdr);
 }
 
 static void encode_fs_locations(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
@@ -2575,7 +2611,7 @@ static void nfs4_xdr_enc_setclientid_confirm(struct rpc_rqst *req,
 	struct compound_hdr hdr = {
 		.nops	= 0,
 	};
-	const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
+	const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };
 
 	encode_compound_hdr(xdr, req, &hdr);
 	encode_setclientid_confirm(xdr, arg, &hdr);
@@ -2719,7 +2755,7 @@ static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req,
 	struct compound_hdr hdr = {
 		.minorversion = nfs4_xdr_minorversion(&args->la_seq_args),
 	};
-	const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
+	const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };
 
 	encode_compound_hdr(xdr, req, &hdr);
 	encode_sequence(xdr, &args->la_seq_args, &hdr);
@@ -2947,14 +2983,17 @@ static int decode_attr_bitmap(struct xdr_stream *xdr, uint32_t *bitmap)
 		goto out_overflow;
 	bmlen = be32_to_cpup(p);
 
-	bitmap[0] = bitmap[1] = 0;
+	bitmap[0] = bitmap[1] = bitmap[2] = 0;
 	p = xdr_inline_decode(xdr, (bmlen << 2));
 	if (unlikely(!p))
 		goto out_overflow;
 	if (bmlen > 0) {
 		bitmap[0] = be32_to_cpup(p++);
-		if (bmlen > 1)
-			bitmap[1] = be32_to_cpup(p);
+		if (bmlen > 1) {
+			bitmap[1] = be32_to_cpup(p++);
+			if (bmlen > 2)
+				bitmap[2] = be32_to_cpup(p);
+		}
 	}
 	return 0;
 out_overflow:
@@ -2986,8 +3025,9 @@ static int decode_attr_supported(struct xdr_stream *xdr, uint32_t *bitmap, uint3
 			return ret;
 		bitmap[0] &= ~FATTR4_WORD0_SUPPORTED_ATTRS;
 	} else
-		bitmask[0] = bitmask[1] = 0;
-	dprintk("%s: bitmask=%08x:%08x\n", __func__, bitmask[0], bitmask[1]);
+		bitmask[0] = bitmask[1] = bitmask[2] = 0;
+	dprintk("%s: bitmask=%08x:%08x:%08x\n", __func__,
+		bitmask[0], bitmask[1], bitmask[2]);
 	return 0;
 }
 
@@ -4041,7 +4081,7 @@ out_overflow:
 static int decode_server_caps(struct xdr_stream *xdr, struct nfs4_server_caps_res *res)
 {
 	__be32 *savep;
-	uint32_t attrlen, bitmap[2] = {0};
+	uint32_t attrlen, bitmap[3] = {0};
 	int status;
 
 	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4067,7 +4107,7 @@ xdr_error:
 static int decode_statfs(struct xdr_stream *xdr, struct nfs_fsstat *fsstat)
 {
 	__be32 *savep;
-	uint32_t attrlen, bitmap[2] = {0};
+	uint32_t attrlen, bitmap[3] = {0};
 	int status;
 
 	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4099,7 +4139,7 @@ xdr_error:
 static int decode_pathconf(struct xdr_stream *xdr, struct nfs_pathconf *pathconf)
 {
 	__be32 *savep;
-	uint32_t attrlen, bitmap[2] = {0};
+	uint32_t attrlen, bitmap[3] = {0};
 	int status;
 
 	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4239,7 +4279,7 @@ static int decode_getfattr_generic(struct xdr_stream *xdr, struct nfs_fattr *fat
 {
 	__be32 *savep;
 	uint32_t attrlen,
-		 bitmap[2] = {0};
+		 bitmap[3] = {0};
 	int status;
 
 	status = decode_op_hdr(xdr, OP_GETATTR);
@@ -4325,10 +4365,32 @@ static int decode_attr_pnfstype(struct xdr_stream *xdr, uint32_t *bitmap,
 	return status;
 }
 
+/*
+ * The prefered block size for layout directed io
+ */
+static int decode_attr_layout_blksize(struct xdr_stream *xdr, uint32_t *bitmap,
+				      uint32_t *res)
+{
+	__be32 *p;
+
+	dprintk("%s: bitmap is %x\n", __func__, bitmap[2]);
+	*res = 0;
+	if (bitmap[2] & FATTR4_WORD2_LAYOUT_BLKSIZE) {
+		p = xdr_inline_decode(xdr, 4);
+		if (unlikely(!p)) {
+			print_overflow_msg(__func__, xdr);
+			return -EIO;
+		}
+		*res = be32_to_cpup(p);
+		bitmap[2] &= ~FATTR4_WORD2_LAYOUT_BLKSIZE;
+	}
+	return 0;
+}
+
 static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
 {
 	__be32 *savep;
-	uint32_t attrlen, bitmap[2];
+	uint32_t attrlen, bitmap[3];
 	int status;
 
 	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4356,6 +4418,9 @@ static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
 	status = decode_attr_pnfstype(xdr, bitmap, &fsinfo->layouttype);
 	if (status != 0)
 		goto xdr_error;
+	status = decode_attr_layout_blksize(xdr, bitmap, &fsinfo->blksize);
+	if (status)
+		goto xdr_error;
 
 	status = verify_attr_len(xdr, savep, attrlen);
 xdr_error:
@@ -4775,7 +4840,7 @@ static int decode_getacl(struct xdr_stream *xdr, struct rpc_rqst *req,
 {
 	__be32 *savep;
 	uint32_t attrlen,
-		 bitmap[2] = {0};
+		 bitmap[3] = {0};
 	struct kvec *iov = req->rq_rcv_buf.head;
 	int status;
 
@@ -6605,7 +6670,7 @@ out:
 int nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry,
 		       int plus)
 {
-	uint32_t bitmap[2] = {0};
+	uint32_t bitmap[3] = {0};
 	uint32_t len;
 	__be32 *p = xdr_inline_decode(xdr, 4);
 	if (unlikely(!p))
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 87694ca..79cc4ca 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -130,7 +130,7 @@ struct nfs_server {
 #endif
 
 #ifdef CONFIG_NFS_V4
-	u32			attr_bitmask[2];/* V4 bitmask representing the set
+	u32			attr_bitmask[3];/* V4 bitmask representing the set
 						   of attributes supported on this
 						   filesystem */
 	u32			cache_consistency_bitmask[2];
@@ -143,6 +143,8 @@ struct nfs_server {
 						   filesystem */
 	struct pnfs_layoutdriver_type  *pnfs_curr_ld; /* Active layout driver */
 	struct rpc_wait_queue	roc_rpcwaitq;
+	void			*pnfs_ld_data; /* per mount point data */
+	u32			pnfs_blksize; /* layout_blksize attr */
 
 	/* the following fields are protected by nfs_client->cl_lock */
 	struct rb_root		state_owners;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 00442f5..a9c43ba 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -122,6 +122,7 @@ struct nfs_fsinfo {
 	struct timespec		time_delta; /* server time granularity */
 	__u32			lease_time; /* in seconds */
 	__u32			layouttype; /* supported pnfs layout driver */
+	__u32			blksize; /* preferred pnfs io block size */
 };
 
 struct nfs_fsstat {
@@ -954,7 +955,7 @@ struct nfs4_server_caps_arg {
 };
 
 struct nfs4_server_caps_res {
-	u32				attr_bitmask[2];
+	u32				attr_bitmask[3];
 	u32				acl_bitmask;
 	u32				has_links;
 	u32				has_symlinks;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/34] pnfs: cleanup_layoutcommit
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (4 preceding siblings ...)
  2011-06-12 23:43 ` [PATCH 05/34] pnfs: ask for layout_blksize and save it in nfs_server Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-13 21:19   ` Benny Halevy
                     ` (2 more replies)
  2011-06-12 23:44 ` [PATCH 07/34] pnfsblock: define PNFS_BLOCK Kconfig option Jim Rees
                   ` (27 subsequent siblings)
  33 siblings, 3 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Peng Tao <bergwolf@gmail.com>

This gives layout driver a chance to cleanup structures they put in.
Also ensure layoutcommit does not commit more than isize, as block layout
driver may dirty pages beyond EOF.

Signed-off-by: Andy Adamson <andros@netapp.com>
[fixup layout header pointer for layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
 fs/nfs/nfs4proc.c       |    1 +
 fs/nfs/nfs4xdr.c        |    3 ++-
 fs/nfs/pnfs.c           |   15 +++++++++++++++
 fs/nfs/pnfs.h           |    4 ++++
 include/linux/nfs_xdr.h |    1 +
 5 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 5246db8..e27a648 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5890,6 +5890,7 @@ static void nfs4_layoutcommit_release(void *calldata)
 {
 	struct nfs4_layoutcommit_data *data = calldata;
 
+	pnfs_cleanup_layoutcommit(data->args.inode, data);
 	/* Matched by references in pnfs_set_layoutcommit */
 	put_lseg(data->lseg);
 	put_rpccred(data->cred);
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index fdcbd8f..57295d1 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
 	*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
 	/* Only whole file layouts */
 	p = xdr_encode_hyper(p, 0); /* offset */
-	p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
+	p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
 	*p++ = cpu_to_be32(0); /* reclaim */
 	p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
 	*p++ = cpu_to_be32(1); /* newoffset = TRUE */
@@ -5467,6 +5467,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
 	int status;
 
 	status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
+	res->status = status;
 	if (status)
 		return status;
 
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e693718..48a06a1 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1248,6 +1248,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 {
 	struct nfs_inode *nfsi = NFS_I(wdata->inode);
 	loff_t end_pos = wdata->mds_offset + wdata->res.count;
+	loff_t isize = i_size_read(wdata->inode);
 	bool mark_as_dirty = false;
 
 	spin_lock(&nfsi->vfs_inode.i_lock);
@@ -1261,9 +1262,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 		dprintk("%s: Set layoutcommit for inode %lu ",
 			__func__, wdata->inode->i_ino);
 	}
+	if (end_pos > isize)
+		end_pos = isize;
 	if (end_pos > wdata->lseg->pls_end_pos)
 		wdata->lseg->pls_end_pos = end_pos;
 	spin_unlock(&nfsi->vfs_inode.i_lock);
+	dprintk("%s: lseg %p end_pos %llu\n",
+		__func__, wdata->lseg, wdata->lseg->pls_end_pos);
 
 	/* if pnfs_layoutcommit_inode() runs between inode locks, the next one
 	 * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
@@ -1272,6 +1277,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 }
 EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
 
+void pnfs_cleanup_layoutcommit(struct inode *inode,
+                               struct nfs4_layoutcommit_data *data)
+{
+        struct nfs_server *nfss = NFS_SERVER(inode);
+
+        if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
+                nfss->pnfs_curr_ld->cleanup_layoutcommit(
+                                        NFS_I(inode)->layout, data);
+}
+
 void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
 {
 	/* lseg refcounting handled directly in nfs_write_end */
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 525ec55..5048898 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -127,6 +127,9 @@ struct pnfs_layoutdriver_type {
 				     struct xdr_stream *xdr,
 				     const struct nfs4_layoutreturn_args *args);
 
+        void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
+                                      struct nfs4_layoutcommit_data *data);
+
 	void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
 				     struct xdr_stream *xdr,
 				     const struct nfs4_layoutcommit_args *args);
@@ -213,6 +216,7 @@ void pnfs_roc_release(struct inode *ino);
 void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
 bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
 void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
+void pnfs_cleanup_layoutcommit(struct inode *inode, struct nfs4_layoutcommit_data *data);
 int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
 int _pnfs_return_layout(struct inode *);
 int pnfs_ld_write_done(struct nfs_write_data *);
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index a9c43ba..2c3ffda 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -270,6 +270,7 @@ struct nfs4_layoutcommit_res {
 	struct nfs_fattr *fattr;
 	const struct nfs_server *server;
 	struct nfs4_sequence_res seq_res;
+	int status;
 };
 
 struct nfs4_layoutcommit_data {
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/34] pnfsblock: define PNFS_BLOCK Kconfig option
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (5 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 06/34] pnfs: cleanup_layoutcommit Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-14 15:13   ` Benny Halevy
  2011-06-12 23:44 ` [PATCH 08/34] pnfsblock: blocklayout stub Jim Rees
                   ` (26 subsequent siblings)
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Define a configuration variable to enable/disable compilation of the
block driver code.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs-block: fix CONFIG_PNFS_BLOCK dependencies]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/Kconfig              |    8 ++++++++
 fs/nfs/Makefile             |    1 +
 fs/nfs/blocklayout/Makefile |    5 +++++
 3 files changed, 14 insertions(+), 0 deletions(-)
 create mode 100644 fs/nfs/blocklayout/Makefile

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 8151554..3cebf1b 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -97,6 +97,14 @@ config PNFS_OBJLAYOUT
 
 	  If unsure, say N.
 
+config PNFS_BLOCK
+	tristate "Provide a pNFS block client (EXPERIMENTAL)"
+	depends on NFS_FS && PNFS
+	help
+	  Say M or y here if you want your pNfs client to support the block protocol
+
+	  If unsure, say N.
+
 config ROOT_NFS
 	bool "Root file system on NFS"
 	depends on NFS_FS=y && IP_PNP
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6a34f7d..b58613d 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
 nfs_layout_nfsv41_files-y := nfs4filelayout.o nfs4filelayoutdev.o
 
 obj-$(CONFIG_PNFS_OBJLAYOUT) += objlayout/
+obj-$(CONFIG_PNFS_BLOCK) += blocklayout/
diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
new file mode 100644
index 0000000..f214c1c
--- /dev/null
+++ b/fs/nfs/blocklayout/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the pNFS block layout driver kernel module
+#
+obj-$(CONFIG_PNFS_BLOCK) +=
+blocklayoutdriver-objs :=
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/34] pnfsblock: blocklayout stub
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (6 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 07/34] pnfsblock: define PNFS_BLOCK Kconfig option Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 09/34] pnfsblock: layout alloc and free Jim Rees
                   ` (25 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Adds the minimal structure for a pnfs block layout driver,
with all function pointers aimed at stubs.

[pnfsblock: SQUASHME: port block layout code]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfs: move pnfs_layout_type inline in nfs_inode]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/Makefile      |    4 +-
 fs/nfs/blocklayout/blocklayout.c |  166 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 168 insertions(+), 2 deletions(-)
 create mode 100644 fs/nfs/blocklayout/blocklayout.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index f214c1c..6bf49cd 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -1,5 +1,5 @@
 #
 # Makefile for the pNFS block layout driver kernel module
 #
-obj-$(CONFIG_PNFS_BLOCK) +=
-blocklayoutdriver-objs :=
+obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
+blocklayoutdriver-objs := blocklayout.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
new file mode 100644
index 0000000..2e0d41a
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -0,0 +1,166 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayout.c
+ *
+ *  Module for the NFSv4.1 pNFS block layout driver.
+ *
+ *  Copyright (c) 2006 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Andy Adamson <andros@citi.umich.edu>
+ *  Fred Isaman <iisaman@umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+
+#include "../pnfs.h"
+
+#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Andy Adamson <andros@citi.umich.edu>");
+MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");
+
+static enum pnfs_try_status
+bl_read_pagelist(struct nfs_read_data *rdata)
+{
+	return PNFS_NOT_ATTEMPTED;
+}
+
+static enum pnfs_try_status
+bl_write_pagelist(struct nfs_write_data *wdata,
+		  int sync)
+{
+	return PNFS_NOT_ATTEMPTED;
+}
+
+static void
+bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
+{
+}
+
+static struct pnfs_layout_hdr *
+bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
+{
+	return NULL;
+}
+
+static void
+bl_free_lseg(struct pnfs_layout_segment *lseg)
+{
+}
+
+static struct pnfs_layout_segment *
+bl_alloc_lseg(struct pnfs_layout_hdr *lo,
+	      struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
+{
+	return NULL;
+}
+
+static void
+bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
+		       const struct nfs4_layoutcommit_args *arg)
+{
+}
+
+static void
+bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
+			struct nfs4_layoutcommit_data *lcdata)
+{
+}
+
+static int
+bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
+{
+	dprintk("%s enter\n", __func__);
+	return 0;
+}
+
+static int
+bl_clear_layoutdriver(struct nfs_server *server)
+{
+	dprintk("%s enter\n", __func__);
+	return 0;
+}
+
+static int
+bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
+	       unsigned count, struct pnfs_fsdata *fsdata)
+{
+	return 0;
+}
+
+static int
+bl_write_end(struct inode *inode, struct page *page, loff_t pos,
+	     unsigned count, unsigned copied, struct pnfs_layout_segment *lseg)
+{
+	return 0;
+}
+
+/* Return any memory allocated to fsdata->private, and take advantage
+ * of no page locks to mark pages noted in write_begin as needing
+ * initialization.
+ */
+static void
+bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
+{
+}
+
+static struct pnfs_layoutdriver_type blocklayout_type = {
+	.id = LAYOUT_BLOCK_VOLUME,
+	.name = "LAYOUT_BLOCK_VOLUME",
+	.read_pagelist			= bl_read_pagelist,
+	.write_pagelist			= bl_write_pagelist,
+	.write_begin			= bl_write_begin,
+	.write_end			= bl_write_end,
+	.write_end_cleanup		= bl_write_end_cleanup,
+	.alloc_layout_hdr		= bl_alloc_layout_hdr,
+	.free_layout_hdr		= bl_free_layout_hdr,
+	.alloc_lseg			= bl_alloc_lseg,
+	.free_lseg			= bl_free_lseg,
+	.encode_layoutcommit		= bl_encode_layoutcommit,
+	.cleanup_layoutcommit		= bl_cleanup_layoutcommit,
+	.set_layoutdriver		= bl_set_layoutdriver,
+	.clear_layoutdriver		= bl_clear_layoutdriver,
+	.pg_test                        = pnfs_generic_pg_test,
+};
+
+static int __init nfs4blocklayout_init(void)
+{
+	int ret;
+
+	dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);
+
+	ret = pnfs_register_layoutdriver(&blocklayout_type);
+	return ret;
+}
+
+static void __exit nfs4blocklayout_exit(void)
+{
+	dprintk("%s: NFSv4 Block Layout Driver Unregistering...\n",
+	       __func__);
+
+	pnfs_unregister_layoutdriver(&blocklayout_type);
+}
+
+module_init(nfs4blocklayout_init);
+module_exit(nfs4blocklayout_exit);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/34] pnfsblock: layout alloc and free
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (7 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 08/34] pnfsblock: blocklayout stub Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 10/34] Add support for simple rpc pipefs Jim Rees
                   ` (24 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Allocate the empty list-heads that will hold all the extent data
for the layout.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfs: move pnfs_layout_type inline in nfs_inode]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |   39 +++++++++++++++-
 fs/nfs/blocklayout/blocklayout.h |   91 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 128 insertions(+), 2 deletions(-)
 create mode 100644 fs/nfs/blocklayout/blocklayout.h

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 2e0d41a..8218d54 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -32,7 +32,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-#include "../pnfs.h"
+#include "blocklayout.h"
 
 #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
 
@@ -53,15 +53,50 @@ bl_write_pagelist(struct nfs_write_data *wdata,
 	return PNFS_NOT_ATTEMPTED;
 }
 
+/* STUB */
+static void
+release_extents(struct pnfs_block_layout *bl,
+		struct pnfs_layout_range *range)
+{
+	return;
+}
+
+/* STUB */
+static void
+release_inval_marks(struct pnfs_inval_markings *marks)
+{
+	return;
+}
+
 static void
 bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
 {
+        struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
+
+        dprintk("%s enter\n", __func__);
+        release_extents(bl, NULL);
+        release_inval_marks(&bl->bl_inval);
+        kfree(bl);
 }
 
 static struct pnfs_layout_hdr *
 bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
 {
-	return NULL;
+        struct pnfs_block_layout        *bl;
+
+        dprintk("%s enter\n", __func__);
+        bl = kzalloc(sizeof(*bl), gfp_flags);
+        if (!bl)
+                return NULL;
+        spin_lock_init(&bl->bl_ext_lock);
+        INIT_LIST_HEAD(&bl->bl_extents[0]);
+        INIT_LIST_HEAD(&bl->bl_extents[1]);
+        INIT_LIST_HEAD(&bl->bl_commit);
+        INIT_LIST_HEAD(&bl->bl_committing);
+        bl->bl_count = 0;
+        bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
+        INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
+        return &bl->bl_layout;
 }
 
 static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
new file mode 100644
index 0000000..8ea82b8
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -0,0 +1,91 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayout.h
+ *
+ *  Module for the NFSv4.1 pNFS block layout driver.
+ *
+ *  Copyright (c) 2006 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Andy Adamson <andros@citi.umich.edu>
+ *  Fred Isaman <iisaman@umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#ifndef FS_NFS_NFS4BLOCKLAYOUT_H
+#define FS_NFS_NFS4BLOCKLAYOUT_H
+
+#include <linux/nfs_fs.h>
+#include "../pnfs.h"
+
+enum exstate4 {
+	PNFS_BLOCK_READWRITE_DATA	= 0,
+	PNFS_BLOCK_READ_DATA		= 1,
+	PNFS_BLOCK_INVALID_DATA		= 2, /* mapped, but data is invalid */
+	PNFS_BLOCK_NONE_DATA		= 3  /* unmapped, it's a hole */
+};
+
+struct pnfs_inval_markings {
+	/* STUB */
+};
+
+/* sector_t fields are all in 512-byte sectors */
+struct pnfs_block_extent {
+	struct kref	be_refcnt;
+	struct list_head be_node;	/* link into lseg list */
+	struct nfs4_deviceid be_devid;  /* STUB - remevable??? */
+	struct block_device *be_mdev;
+	sector_t	be_f_offset;	/* the starting offset in the file */
+	sector_t	be_length;	/* the size of the extent */
+	sector_t	be_v_offset;	/* the starting offset in the volume */
+	enum exstate4	be_state;	/* the state of this extent */
+	struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
+};
+
+static inline void
+INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
+{
+	/* STUB */
+}
+
+enum extentclass4 {
+        RW_EXTENT       = 0, /* READWRTE and INVAL */
+        RO_EXTENT       = 1, /* READ and NONE */
+        EXTENT_LISTS    = 2,
+};
+
+struct pnfs_block_layout {
+	struct pnfs_layout_hdr bl_layout;
+	struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
+	spinlock_t		bl_ext_lock;   /* Protects list manipulation */
+	struct list_head	bl_extents[EXTENT_LISTS]; /* R and RW extents */
+	struct list_head	bl_commit;	/* Needs layout commit */
+	struct list_head	bl_committing;	/* Layout committing */
+	unsigned int		bl_count;	/* entries in bl_commit */
+	sector_t		bl_blocksize;  /* Server blocksize in sectors */
+};
+
+static inline struct pnfs_block_layout *
+BLK_LO2EXT(struct pnfs_layout_hdr *lo)
+{
+        return container_of(lo, struct pnfs_block_layout, bl_layout);
+}
+
+#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/34] Add support for simple rpc pipefs
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (8 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 09/34] pnfsblock: layout alloc and free Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 11/34] pnfs-block: Add block device discovery pipe Jim Rees
                   ` (23 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

pnfs-block: Add support for simple rpc pipefs
Signed-off-by: Eric Anderle <eanderle@umich.edu>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
move include lines out of include file
Signed-off-by: Jim Rees <rees@umich.edu>
[This patch does *not* break the header's independence]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 include/linux/sunrpc/simple_rpc_pipefs.h |  105 ++++++++
 net/sunrpc/simple_rpc_pipefs.c           |  423 ++++++++++++++++++++++++++++++
 2 files changed, 528 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/sunrpc/simple_rpc_pipefs.h
 create mode 100644 net/sunrpc/simple_rpc_pipefs.c

diff --git a/include/linux/sunrpc/simple_rpc_pipefs.h b/include/linux/sunrpc/simple_rpc_pipefs.h
new file mode 100644
index 0000000..f6a1227
--- /dev/null
+++ b/include/linux/sunrpc/simple_rpc_pipefs.h
@@ -0,0 +1,105 @@
+/*
+ *  Copyright (c) 2008 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  David M. Richter <richterd@citi.umich.edu>
+ *
+ *  Drawing on work done by Andy Adamson <andros@citi.umich.edu> and
+ *  Marius Eriksen <marius@monkey.org>.  Thanks for the help over the
+ *  years, guys.
+ *
+ *  Redistribution and use in source and binary forms, with or without
+ *  modification, are permitted provided that the following conditions
+ *  are met:
+ *
+ *  1. Redistributions of source code must retain the above copyright
+ *     notice, this list of conditions and the following disclaimer.
+ *  2. Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *  3. Neither the name of the University nor the names of its
+ *     contributors may be used to endorse or promote products derived
+ *     from this software without specific prior written permission.
+ *
+ *  THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
+ *  WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+ *  MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ *  DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ *  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *  CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *  SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ *  BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ *  LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ *  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ *  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ *  With thanks to CITI's project sponsor and partner, IBM.
+ */
+
+#ifndef _SIMPLE_RPC_PIPEFS_H_
+#define _SIMPLE_RPC_PIPEFS_H_
+
+#include <linux/sunrpc/rpc_pipe_fs.h>
+
+#define payload_of(headerp)  ((void *)(headerp + 1))
+
+/*
+ * struct pipefs_hdr -- the generic message format for simple_rpc_pipefs.
+ * Messages may simply be the header itself, although having an optional
+ * data payload follow the header allows much more flexibility.
+ *
+ * Messages are created using pipefs_alloc_init_msg() and
+ * pipefs_alloc_init_msg_padded(), both of which accept a pointer to an
+ * (optional) data payload.
+ *
+ * Given a struct pipefs_hdr *msg that has a struct foo payload, the data
+ * can be accessed using: struct foo *foop = payload_of(msg)
+ */
+struct pipefs_hdr {
+	u32 msgid;
+	u8  type;
+	u8  flags;
+	u16 totallen; /* length of entire message, including hdr itself */
+	u32 status;
+};
+
+/*
+ * struct pipefs_list -- a type of list used for tracking callers who've made an
+ * upcall and are blocked waiting for a reply.
+ *
+ * See pipefs_queue_upcall_waitreply() and pipefs_assign_upcall_reply().
+ */
+struct pipefs_list {
+	struct list_head list;
+	spinlock_t list_lock;
+};
+
+
+/* See net/sunrpc/simple_rpc_pipefs.c for more info on using these functions. */
+extern struct dentry *pipefs_mkpipe(const char *name,
+				    const struct rpc_pipe_ops *ops,
+				    int wait_for_open);
+extern void pipefs_closepipe(struct dentry *pipe);
+extern void pipefs_init_list(struct pipefs_list *list);
+extern struct pipefs_hdr *pipefs_alloc_init_msg(u32 msgid, u8 type, u8 flags,
+						void *data, u16 datalen);
+extern struct pipefs_hdr *pipefs_alloc_init_msg_padded(u32 msgid, u8 type,
+						       u8 flags, void *data,
+						       u16 datalen, u16 padlen);
+extern struct pipefs_hdr *pipefs_queue_upcall_waitreply(struct dentry *pipe,
+							struct pipefs_hdr *msg,
+							struct pipefs_list
+							*uplist, u8 upflags,
+							u32 timeout);
+extern int pipefs_queue_upcall_noreply(struct dentry *pipe,
+				       struct pipefs_hdr *msg, u8 upflags);
+extern int pipefs_assign_upcall_reply(struct pipefs_hdr *reply,
+				      struct pipefs_list *uplist);
+extern struct pipefs_hdr *pipefs_readmsg(struct file *filp,
+					 const char __user *src, size_t len);
+extern ssize_t pipefs_generic_upcall(struct file *filp,
+				     struct rpc_pipe_msg *rpcmsg,
+				     char __user *dst, size_t buflen);
+extern void pipefs_generic_destroy_msg(struct rpc_pipe_msg *rpcmsg);
+
+#endif /* _SIMPLE_RPC_PIPEFS_H_ */
diff --git a/net/sunrpc/simple_rpc_pipefs.c b/net/sunrpc/simple_rpc_pipefs.c
new file mode 100644
index 0000000..24af0a1
--- /dev/null
+++ b/net/sunrpc/simple_rpc_pipefs.c
@@ -0,0 +1,423 @@
+/*
+ *  net/sunrpc/simple_rpc_pipefs.c
+ *
+ *  Copyright (c) 2008 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  David M. Richter <richterd@citi.umich.edu>
+ *
+ *  Drawing on work done by Andy Adamson <andros@citi.umich.edu> and
+ *  Marius Eriksen <marius@monkey.org>.  Thanks for the help over the
+ *  years, guys.
+ *
+ *  Redistribution and use in source and binary forms, with or without
+ *  modification, are permitted provided that the following conditions
+ *  are met:
+ *
+ *  1. Redistributions of source code must retain the above copyright
+ *     notice, this list of conditions and the following disclaimer.
+ *  2. Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *  3. Neither the name of the University nor the names of its
+ *     contributors may be used to endorse or promote products derived
+ *     from this software without specific prior written permission.
+ *
+ *  THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
+ *  WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+ *  MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ *  DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ *  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *  CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *  SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ *  BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ *  LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ *  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ *  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ *  With thanks to CITI's project sponsor and partner, IBM.
+ */
+
+#include <linux/mount.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/sunrpc/simple_rpc_pipefs.h>
+
+
+/*
+ * Make an rpc_pipefs pipe named @name at the root of the mounted rpc_pipefs
+ * filesystem.
+ *
+ * If @wait_for_open is non-zero and an upcall is later queued but the userland
+ * end of the pipe has not yet been opened, the upcall will remain queued until
+ * the pipe is opened; otherwise, the upcall queueing will return with -EPIPE.
+ */
+struct dentry *pipefs_mkpipe(const char *name, const struct rpc_pipe_ops *ops,
+			     int wait_for_open)
+{
+	struct dentry *dir, *pipe;
+	struct vfsmount *mnt;
+
+	mnt = rpc_get_mount();
+	if (IS_ERR(mnt)) {
+		pipe = ERR_CAST(mnt);
+		goto out;
+	}
+	dir = mnt->mnt_root;
+	if (!dir) {
+		pipe = ERR_PTR(-ENOENT);
+		goto out;
+	}
+	pipe = rpc_mkpipe(dir, name, NULL, ops,
+			  wait_for_open ? RPC_PIPE_WAIT_FOR_OPEN : 0);
+out:
+	return pipe;
+}
+EXPORT_SYMBOL(pipefs_mkpipe);
+
+/*
+ * Shutdown a pipe made by pipefs_mkpipe().
+ * XXX: do we need to retain an extra reference on the mount?
+ */
+void pipefs_closepipe(struct dentry *pipe)
+{
+	rpc_unlink(pipe);
+	rpc_put_mount();
+}
+EXPORT_SYMBOL(pipefs_closepipe);
+
+/*
+ * Initialize a struct pipefs_list -- which are a way to keep track of callers
+ * who're blocked having made an upcall and are awaiting a reply.
+ *
+ * See pipefs_queue_upcall_waitreply() and pipefs_find_upcall_msgid() for how
+ * to use them.
+ */
+inline void pipefs_init_list(struct pipefs_list *list)
+{
+	INIT_LIST_HEAD(&list->list);
+	spin_lock_init(&list->list_lock);
+}
+EXPORT_SYMBOL(pipefs_init_list);
+
+/*
+ * Alloc/init a generic pipefs message header and copy into its message body
+ * an arbitrary data payload.
+ *
+ * struct pipefs_hdr's are meant to serve as generic, general-purpose message
+ * headers for easy rpc_pipefs I/O.  When an upcall is made, the
+ * struct pipefs_hdr is assigned to a struct rpc_pipe_msg and delivered
+ * therein.  --And yes, the naming can seem a little confusing at first:
+ *
+ * When one thinks of an upcall "message", in simple_rpc_pipefs that's a
+ * struct pipefs_hdr (possibly with an attached message body).  A
+ * struct rpc_pipe_msg is actually only the -vehicle- by which the "real"
+ * message is delivered and processed.
+ */
+struct pipefs_hdr *pipefs_alloc_init_msg_padded(u32 msgid, u8 type, u8 flags,
+					   void *data, u16 datalen, u16 padlen)
+{
+	u16 totallen;
+	struct pipefs_hdr *msg = NULL;
+
+	totallen = sizeof(*msg) + datalen + padlen;
+	if (totallen > PAGE_SIZE) {
+		msg = ERR_PTR(-E2BIG);
+		goto out;
+	}
+
+	msg = kzalloc(totallen, GFP_KERNEL);
+	if (!msg) {
+		msg = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	msg->msgid = msgid;
+	msg->type = type;
+	msg->flags = flags;
+	msg->totallen = totallen;
+	memcpy(payload_of(msg), data, datalen);
+out:
+	return msg;
+}
+EXPORT_SYMBOL(pipefs_alloc_init_msg_padded);
+
+/*
+ * See the description of pipefs_alloc_init_msg_padded().
+ */
+struct pipefs_hdr *pipefs_alloc_init_msg(u32 msgid, u8 type, u8 flags,
+				    void *data, u16 datalen)
+{
+	return pipefs_alloc_init_msg_padded(msgid, type, flags, data,
+					    datalen, 0);
+}
+EXPORT_SYMBOL(pipefs_alloc_init_msg);
+
+
+static void pipefs_init_rpcmsg(struct rpc_pipe_msg *rpcmsg,
+			       struct pipefs_hdr *msg, u8 upflags)
+{
+	memset(rpcmsg, 0, sizeof(*rpcmsg));
+	rpcmsg->data = msg;
+	rpcmsg->len = msg->totallen;
+	rpcmsg->flags = upflags;
+}
+
+static struct rpc_pipe_msg *pipefs_alloc_init_rpcmsg(struct pipefs_hdr *msg,
+						     u8 upflags)
+{
+	struct rpc_pipe_msg *rpcmsg;
+
+	rpcmsg = kmalloc(sizeof(*rpcmsg), GFP_KERNEL);
+	if (!rpcmsg)
+		return ERR_PTR(-ENOMEM);
+
+	pipefs_init_rpcmsg(rpcmsg, msg, upflags);
+	return rpcmsg;
+}
+
+
+/* represents an upcall that'll block and wait for a reply */
+struct pipefs_upcall {
+	u32 msgid;
+	struct rpc_pipe_msg rpcmsg;
+	struct list_head list;
+	wait_queue_head_t waitq;
+	struct pipefs_hdr *reply;
+};
+
+
+static void pipefs_init_upcall_waitreply(struct pipefs_upcall *upcall,
+					 struct pipefs_hdr *msg, u8 upflags)
+{
+	upcall->reply = NULL;
+	upcall->msgid = msg->msgid;
+	INIT_LIST_HEAD(&upcall->list);
+	init_waitqueue_head(&upcall->waitq);
+	pipefs_init_rpcmsg(&upcall->rpcmsg, msg, upflags);
+}
+
+static int __pipefs_queue_upcall_waitreply(struct dentry *pipe,
+					   struct pipefs_upcall *upcall,
+					   struct pipefs_list *uplist,
+					   u32 timeout)
+{
+	int err = 0;
+	DECLARE_WAITQUEUE(wq, current);
+
+	add_wait_queue(&upcall->waitq, &wq);
+	spin_lock(&uplist->list_lock);
+	list_add(&upcall->list, &uplist->list);
+	spin_unlock(&uplist->list_lock);
+
+	err = rpc_queue_upcall(pipe->d_inode, &upcall->rpcmsg);
+	if (err < 0)
+		goto out;
+
+	if (timeout) {
+		/* retval of 0 means timer expired */
+		err = schedule_timeout_uninterruptible(timeout);
+		if (err == 0 && upcall->reply == NULL)
+			err = -ETIMEDOUT;
+	} else {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule();
+		__set_current_state(TASK_RUNNING);
+	}
+
+out:
+	spin_lock(&uplist->list_lock);
+	list_del_init(&upcall->list);
+	spin_unlock(&uplist->list_lock);
+	remove_wait_queue(&upcall->waitq, &wq);
+	return err;
+}
+
+/*
+ * Queue a pipefs msg for an upcall to userspace, place the calling thread
+ * on @uplist, and block the thread to wait for a reply.  If @timeout is
+ * nonzero, the thread will be blocked for at most @timeout jiffies.
+ *
+ * (To convert time units into jiffies, consider the functions
+ *  msecs_to_jiffies(), usecs_to_jiffies(), timeval_to_jiffies(), and
+ *  timespec_to_jiffies().)
+ *
+ * Once a reply is received by your downcall handler, call
+ * pipefs_assign_upcall_reply() with @uplist to find the corresponding upcall,
+ * assign the reply, and wake the waiting thread.
+ *
+ * This function's return value pointer may be an error and should be checked
+ * with IS_ERR() before attempting to access the reply message.
+ *
+ * Callers are responsible for freeing @msg, unless pipefs_generic_destroy_msg()
+ * is used as the ->destroy_msg() callback and the PIPEFS_AUTOFREE_UPCALL_MSG
+ * flag is set in @upflags.  See also rpc_pipe_fs.h.
+ */
+struct pipefs_hdr *pipefs_queue_upcall_waitreply(struct dentry *pipe,
+					    struct pipefs_hdr *msg,
+					    struct pipefs_list *uplist,
+					    u8 upflags, u32 timeout)
+{
+	int err = 0;
+	struct pipefs_upcall upcall;
+
+	pipefs_init_upcall_waitreply(&upcall, msg, upflags);
+	err = __pipefs_queue_upcall_waitreply(pipe, &upcall, uplist, timeout);
+	if (err < 0) {
+		kfree(upcall.reply);
+		upcall.reply = ERR_PTR(err);
+	}
+
+	return upcall.reply;
+}
+EXPORT_SYMBOL(pipefs_queue_upcall_waitreply);
+
+/*
+ * Queue a pipefs msg for an upcall to userspace and immediately return (i.e.,
+ * no reply is expected).
+ *
+ * Callers are responsible for freeing @msg, unless pipefs_generic_destroy_msg()
+ * is used as the ->destroy_msg() callback and the PIPEFS_AUTOFREE_UPCALL_MSG
+ * flag is set in @upflags.  See also rpc_pipe_fs.h.
+ */
+int pipefs_queue_upcall_noreply(struct dentry *pipe, struct pipefs_hdr *msg,
+				u8 upflags)
+{
+	int err = 0;
+	struct rpc_pipe_msg *rpcmsg;
+
+	upflags |= PIPEFS_AUTOFREE_RPCMSG;
+	rpcmsg = pipefs_alloc_init_rpcmsg(msg, upflags);
+	if (IS_ERR(rpcmsg)) {
+		err = PTR_ERR(rpcmsg);
+		goto out;
+	}
+	err = rpc_queue_upcall(pipe->d_inode, rpcmsg);
+out:
+	return err;
+}
+EXPORT_SYMBOL(pipefs_queue_upcall_noreply);
+
+
+static struct pipefs_upcall *pipefs_find_upcall_msgid(u32 msgid,
+						 struct pipefs_list *uplist)
+{
+	struct pipefs_upcall *upcall;
+
+	spin_lock(&uplist->list_lock);
+	list_for_each_entry(upcall, &uplist->list, list)
+		if (upcall->msgid == msgid)
+			goto out;
+	upcall = NULL;
+out:
+	spin_unlock(&uplist->list_lock);
+	return upcall;
+}
+
+/*
+ * In your rpc_pipe_ops->downcall() handler, once you've read in a downcall
+ * message and have determined that it is a reply to a waiting upcall,
+ * you can use this function to find the appropriate upcall, assign the result,
+ * and wake the upcall thread.
+ *
+ * The reply message must have the same msgid as the original upcall message's.
+ *
+ * See also pipefs_queue_upcall_waitreply() and pipefs_readmsg().
+ */
+int pipefs_assign_upcall_reply(struct pipefs_hdr *reply,
+			       struct pipefs_list *uplist)
+{
+	int err = 0;
+	struct pipefs_upcall *upcall;
+
+	upcall = pipefs_find_upcall_msgid(reply->msgid, uplist);
+	if (!upcall) {
+		printk(KERN_ERR "%s: ERROR: have reply but no matching upcall "
+			"for msgid %d\n", __func__, reply->msgid);
+		err = -ENOENT;
+		goto out;
+	}
+	upcall->reply = reply;
+	wake_up(&upcall->waitq);
+out:
+	return err;
+}
+EXPORT_SYMBOL(pipefs_assign_upcall_reply);
+
+/*
+ * Generic method to read-in and return a newly-allocated message which begins
+ * with a struct pipefs_hdr.
+ */
+struct pipefs_hdr *pipefs_readmsg(struct file *filp, const char __user *src,
+			     size_t len)
+{
+	int err = 0, hdrsize;
+	struct pipefs_hdr *msg = NULL;
+
+	hdrsize = sizeof(*msg);
+	if (len < hdrsize) {
+		printk(KERN_ERR "%s: ERROR: header is too short (%d vs %d)\n",
+		       __func__, (int) len, hdrsize);
+		err = -EINVAL;
+		goto out;
+	}
+
+	msg = kzalloc(len, GFP_KERNEL);
+	if (!msg) {
+		err = -ENOMEM;
+		goto out;
+	}
+	if (copy_from_user(msg, src, len))
+		err = -EFAULT;
+out:
+	if (err) {
+		kfree(msg);
+		msg = ERR_PTR(err);
+	}
+	return msg;
+}
+EXPORT_SYMBOL(pipefs_readmsg);
+
+/*
+ * Generic rpc_pipe_ops->upcall() handler implementation.
+ *
+ * Don't call this directly: to make an upcall, use
+ * pipefs_queue_upcall_waitreply() or pipefs_queue_upcall_noreply().
+ */
+ssize_t pipefs_generic_upcall(struct file *filp, struct rpc_pipe_msg *rpcmsg,
+			      char __user *dst, size_t buflen)
+{
+	char *data;
+	ssize_t len, left;
+
+	data = (char *)rpcmsg->data + rpcmsg->copied;
+	len = rpcmsg->len - rpcmsg->copied;
+	if (len > buflen)
+		len = buflen;
+
+	left = copy_to_user(dst, data, len);
+	if (left < 0) {
+		rpcmsg->errno = left;
+		return left;
+	}
+
+	len -= left;
+	rpcmsg->copied += len;
+	rpcmsg->errno = 0;
+	return len;
+}
+EXPORT_SYMBOL(pipefs_generic_upcall);
+
+/*
+ * Generic rpc_pipe_ops->destroy_msg() handler implementation.
+ *
+ * Items are only freed if @rpcmsg->flags has been set appropriately.
+ * See pipefs_queue_upcall_noreply() and rpc_pipe_fs.h.
+ */
+void pipefs_generic_destroy_msg(struct rpc_pipe_msg *rpcmsg)
+{
+	if (rpcmsg->flags & PIPEFS_AUTOFREE_UPCALL_MSG)
+		kfree(rpcmsg->data);
+	if (rpcmsg->flags & PIPEFS_AUTOFREE_RPCMSG)
+		kfree(rpcmsg);
+}
+EXPORT_SYMBOL(pipefs_generic_destroy_msg);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/34] pnfs-block: Add block device discovery pipe
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (9 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 10/34] Add support for simple rpc pipefs Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 12/34] pnfsblock: basic extent code Jim Rees
                   ` (22 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

Signed-off-by: Eric Anderle <eanderle@umich.edu>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/Makefile                      |    2 +-
 fs/nfs/blocklayout/block-device-discovery-pipe.c |   66 ++++++++++++++++++++++
 fs/nfs/blocklayout/blocklayout.c                 |    3 +
 fs/nfs/blocklayout/blocklayout.h                 |   14 +++++
 4 files changed, 84 insertions(+), 1 deletions(-)
 create mode 100644 fs/nfs/blocklayout/block-device-discovery-pipe.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 6bf49cd..d2bcd81 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
 # Makefile for the pNFS block layout driver kernel module
 #
 obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o
+blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o
diff --git a/fs/nfs/blocklayout/block-device-discovery-pipe.c b/fs/nfs/blocklayout/block-device-discovery-pipe.c
new file mode 100644
index 0000000..e4c199f
--- /dev/null
+++ b/fs/nfs/blocklayout/block-device-discovery-pipe.c
@@ -0,0 +1,66 @@
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/proc_fs.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/ctype.h>
+#include <linux/sched.h>
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+struct pipefs_list bl_device_list;
+struct dentry *bl_device_pipe;
+
+ssize_t bl_pipe_downcall(struct file *filp, const char __user *src, size_t len)
+{
+	int err;
+	struct pipefs_hdr *msg;
+
+	dprintk("Entering %s...\n", __func__);
+
+	msg = pipefs_readmsg(filp, src, len);
+	if (IS_ERR(msg)) {
+		dprintk("ERROR: unable to read pipefs message.\n");
+		return PTR_ERR(msg);
+	}
+
+	/* now assign the result, which wakes the blocked thread */
+	err = pipefs_assign_upcall_reply(msg, &bl_device_list);
+	if (err) {
+		dprintk("ERROR: failed to assign upcall with id %u\n",
+			msg->msgid);
+		kfree(msg);
+	}
+	return len;
+}
+
+static const struct rpc_pipe_ops bl_pipe_ops = {
+	.upcall         = pipefs_generic_upcall,
+	.downcall       = bl_pipe_downcall,
+	.destroy_msg    = pipefs_generic_destroy_msg,
+};
+
+int bl_pipe_init(void)
+{
+	dprintk("%s: block_device pipefs registering...\n", __func__);
+	bl_device_pipe = pipefs_mkpipe("bl_device_pipe", &bl_pipe_ops, 1);
+	if (IS_ERR(bl_device_pipe))
+		dprintk("ERROR, unable to make block_device pipe\n");
+
+	if (!bl_device_pipe)
+		dprintk("bl_device_pipe is NULL!\n");
+	else
+	dprintk("bl_device_pipe created!\n");
+	pipefs_init_list(&bl_device_list);
+	return 0;
+}
+
+void bl_pipe_exit(void)
+{
+	dprintk("%s: block_device pipefs unregistering...\n", __func__);
+	if (IS_ERR(bl_device_pipe))
+		return ;
+	pipefs_closepipe(bl_device_pipe);
+	return;
+}
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 8218d54..9932519 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -186,6 +186,8 @@ static int __init nfs4blocklayout_init(void)
 	dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);
 
 	ret = pnfs_register_layoutdriver(&blocklayout_type);
+	if (!ret)
+		bl_pipe_init();
 	return ret;
 }
 
@@ -195,6 +197,7 @@ static void __exit nfs4blocklayout_exit(void)
 	       __func__);
 
 	pnfs_unregister_layoutdriver(&blocklayout_type);
+	bl_pipe_exit();
 }
 
 module_init(nfs4blocklayout_init);
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 8ea82b8..2a78462 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -88,4 +88,18 @@ BLK_LO2EXT(struct pnfs_layout_hdr *lo)
         return container_of(lo, struct pnfs_block_layout, bl_layout);
 }
 
+#include <linux/sunrpc/simple_rpc_pipefs.h>
+
+extern struct pipefs_list bl_device_list;
+extern struct dentry *bl_device_pipe;
+
+int bl_pipe_init(void);
+void bl_pipe_exit(void);
+
+#define BL_DEVICE_UMOUNT               0x0 /* Umount--delete devices */
+#define BL_DEVICE_MOUNT                0x1 /* Mount--create devices*/
+#define BL_DEVICE_REQUEST_INIT         0x0 /* Start request */
+#define BL_DEVICE_REQUEST_PROC         0x1 /* User level process succeeds */
+#define BL_DEVICE_REQUEST_ERR          0x2 /* User level process fails */
+
 #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/34] pnfsblock: basic extent code
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (10 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 11/34] pnfs-block: Add block device discovery pipe Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 13/34] pnfsblock: add device operations Jim Rees
                   ` (21 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Adds structures and basic create/delete code for extents.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Zhang Jingwang <Jingwang.Zhang@emc.com>
---
 fs/nfs/blocklayout/Makefile      |    2 +-
 fs/nfs/blocklayout/blocklayout.c |   17 ++++++-
 fs/nfs/blocklayout/blocklayout.h |    1 +
 fs/nfs/blocklayout/extents.c     |   97 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 114 insertions(+), 3 deletions(-)
 create mode 100644 fs/nfs/blocklayout/extents.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index d2bcd81..af39d19 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
 # Makefile for the pNFS block layout driver kernel module
 #
 obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o
+blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 9932519..a245e73 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -53,12 +53,25 @@ bl_write_pagelist(struct nfs_write_data *wdata,
 	return PNFS_NOT_ATTEMPTED;
 }
 
-/* STUB */
+/* FIXME - range ignored */
 static void
 release_extents(struct pnfs_block_layout *bl,
 		struct pnfs_layout_range *range)
 {
-	return;
+        int i;
+        struct pnfs_block_extent *be;
+
+        spin_lock(&bl->bl_ext_lock);
+        for (i = 0; i < EXTENT_LISTS; i++) {
+                while (!list_empty(&bl->bl_extents[i])) {
+                        be = list_first_entry(&bl->bl_extents[i],
+					struct pnfs_block_extent,
+					be_node);
+                        list_del(&be->be_node);
+                        put_extent(be);
+                }
+        }
+        spin_unlock(&bl->bl_ext_lock);
 }
 
 /* STUB */
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 2a78462..ef1e64d 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -102,4 +102,5 @@ void bl_pipe_exit(void);
 #define BL_DEVICE_REQUEST_PROC         0x1 /* User level process succeeds */
 #define BL_DEVICE_REQUEST_ERR          0x2 /* User level process fails */
 
+void put_extent(struct pnfs_block_extent *be);
 #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
new file mode 100644
index 0000000..1283fa9
--- /dev/null
+++ b/fs/nfs/blocklayout/extents.c
@@ -0,0 +1,97 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayout.h
+ *
+ *  Module for the NFSv4.1 pNFS block layout driver.
+ *
+ *  Copyright (c) 2006 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Andy Adamson <andros@citi.umich.edu>
+ *  Fred Isaman <iisaman@umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+
+#include "blocklayout.h"
+#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
+
+static void print_bl_extent(struct pnfs_block_extent *be)
+{
+	dprintk("PRINT EXTENT extent %p\n", be);
+	if (be) {
+		dprintk("        be_f_offset %llu\n", (u64)be->be_f_offset);
+		dprintk("        be_length   %llu\n", (u64)be->be_length);
+		dprintk("        be_v_offset %llu\n", (u64)be->be_v_offset);
+		dprintk("        be_state    %d\n", be->be_state);
+	}
+}
+
+static void
+destroy_extent(struct kref *kref)
+{
+	struct pnfs_block_extent *be;
+
+	be = container_of(kref, struct pnfs_block_extent, be_refcnt);
+	dprintk("%s be=%p\n", __func__, be);
+	kfree(be);
+}
+
+void
+put_extent(struct pnfs_block_extent *be)
+{
+	if (be) {
+		dprintk("%s enter %p (%i)\n", __func__, be,
+			atomic_read(&be->be_refcnt.refcount));
+		kref_put(&be->be_refcnt, destroy_extent);
+	}
+}
+
+struct pnfs_block_extent *alloc_extent(void)
+{
+	struct pnfs_block_extent *be;
+
+	be = kmalloc(sizeof(struct pnfs_block_extent), GFP_KERNEL);
+	if (!be)
+		return NULL;
+	INIT_LIST_HEAD(&be->be_node);
+	kref_init(&be->be_refcnt);
+	be->be_inval = NULL;
+	return be;
+}
+
+struct pnfs_block_extent *
+get_extent(struct pnfs_block_extent *be)
+{
+	if (be)
+		kref_get(&be->be_refcnt);
+	return be;
+}
+
+void print_elist(struct list_head *list)
+{
+	struct pnfs_block_extent *be;
+	dprintk("****************\n");
+	dprintk("Extent list looks like:\n");
+	list_for_each_entry(be, list, be_node) {
+		print_bl_extent(be);
+	}
+	dprintk("****************\n");
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/34] pnfsblock: add device operations
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (11 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 12/34] pnfsblock: basic extent code Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 14/34] pnfsblock: remove " Jim Rees
                   ` (20 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/Makefile         |    2 +-
 fs/nfs/blocklayout/blocklayout.h    |   15 ++++
 fs/nfs/blocklayout/blocklayoutdev.c |  151 +++++++++++++++++++++++++++++++++++
 3 files changed, 167 insertions(+), 1 deletions(-)
 create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index af39d19..bd69aad 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
 # Makefile for the pNFS block layout driver kernel module
 #
 obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o
+blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o blocklayoutdev.o
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index ef1e64d..6c01d8c 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,6 +35,12 @@
 #include <linux/nfs_fs.h>
 #include "../pnfs.h"
 
+struct pnfs_block_dev {
+	struct list_head		bm_node;
+	struct nfs4_deviceid		bm_mdevid;    /* associated devid */
+	struct block_device		*bm_mdev;     /* meta device itself */
+};
+
 enum exstate4 {
 	PNFS_BLOCK_READWRITE_DATA	= 0,
 	PNFS_BLOCK_READ_DATA		= 1,
@@ -88,6 +94,15 @@ BLK_LO2EXT(struct pnfs_layout_hdr *lo)
         return container_of(lo, struct pnfs_block_layout, bl_layout);
 }
 
+/* blocklayoutdev.c */
+struct block_device *nfs4_blkdev_get(dev_t dev);
+int nfs4_blkdev_put(struct block_device *bdev);
+struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
+						struct pnfs_device *dev,
+						struct list_head *sdlist);
+int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
+				struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
+
 #include <linux/sunrpc/simple_rpc_pipefs.h>
 
 extern struct pipefs_list bl_device_list;
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
new file mode 100644
index 0000000..9a65a66
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -0,0 +1,151 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayoutdev.c
+ *
+ *  Device operations for the pnfs nfs4 file layout driver.
+ *
+ *  Copyright (c) 2006 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Andy Adamson <andros@citi.umich.edu>
+ *  Fred Isaman <iisaman@umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#include <linux/module.h>
+#include <linux/buffer_head.h> /* __bread */
+
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+#include <linux/hash.h>
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
+
+uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes)
+{
+	uint32_t *q = p + XDR_QUADLEN(nbytes);
+	if (unlikely(q > end || q < p))
+		return NULL;
+	return p;
+}
+EXPORT_SYMBOL(blk_overflow);
+
+/* Open a block_device by device number. */
+struct block_device *nfs4_blkdev_get(dev_t dev)
+{
+	struct block_device *bd;
+
+	dprintk("%s enter\n", __func__);
+	bd = blkdev_get_by_dev(dev, FMODE_READ, NULL);
+	if (IS_ERR(bd))
+		goto fail;
+	return bd;
+fail:
+	dprintk("%s failed to open device : %ld\n",
+			__func__, PTR_ERR(bd));
+	return NULL;
+}
+
+/*
+ * Release the block device
+ */
+int nfs4_blkdev_put(struct block_device *bdev)
+{
+	dprintk("%s for device %d:%d\n", __func__, MAJOR(bdev->bd_dev),
+			MINOR(bdev->bd_dev));
+	return blkdev_put(bdev, FMODE_READ);
+}
+
+/* Decodes pnfs_block_deviceaddr4 (draft-8) which is XDR encoded
+ * in dev->dev_addr_buf.
+ */
+struct pnfs_block_dev *
+nfs4_blk_decode_device(struct nfs_server *server,
+		       struct pnfs_device *dev,
+		       struct list_head *sdlist)
+{
+	struct pnfs_block_dev *rv = NULL;
+	struct block_device *bd = NULL;
+	struct pipefs_hdr *msg = NULL, *reply = NULL;
+	uint32_t major, minor;
+
+	dprintk("%s enter\n", __func__);
+
+	if (IS_ERR(bl_device_pipe))
+		return NULL;
+	dprintk("%s CREATING PIPEFS MESSAGE\n", __func__);
+	dprintk("%s: deviceid: %s, mincount: %d\n", __func__, dev->dev_id.data,
+		dev->mincount);
+	msg = pipefs_alloc_init_msg(0, BL_DEVICE_MOUNT, 0, dev->area,
+				    dev->mincount);
+	if (IS_ERR(msg)) {
+		dprintk("ERROR: couldn't make pipefs message.\n");
+		goto out_err;
+	}
+	msg->msgid = hash_ptr(&msg, sizeof(msg->msgid) * 8);
+	msg->status = BL_DEVICE_REQUEST_INIT;
+
+	dprintk("%s CALLING USERSPACE DAEMON\n", __func__);
+	reply = pipefs_queue_upcall_waitreply(bl_device_pipe, msg,
+					      &bl_device_list, 0, 0);
+
+	if (IS_ERR(reply)) {
+		dprintk("ERROR: upcall_waitreply failed\n");
+		goto out_err;
+	}
+	if (reply->status != BL_DEVICE_REQUEST_PROC) {
+		dprintk("%s failed to open device: %ld\n",
+			__func__, PTR_ERR(bd));
+		goto out_err;
+	}
+	memcpy(&major, (uint32_t *)(payload_of(reply)), sizeof(uint32_t));
+	memcpy(&minor, (uint32_t *)(payload_of(reply) + sizeof(uint32_t)),
+		sizeof(uint32_t));
+	bd = nfs4_blkdev_get(MKDEV(major, minor));
+	if (IS_ERR(bd)) {
+		dprintk("%s failed to open device : %ld\n",
+			__func__, PTR_ERR(bd));
+		goto out_err;
+	}
+
+	rv = kzalloc(sizeof(*rv), GFP_KERNEL);
+	if (!rv)
+		goto out_err;
+
+	rv->bm_mdev = bd;
+	memcpy(&rv->bm_mdevid, &dev->dev_id, sizeof(struct nfs4_deviceid));
+	dprintk("%s Created device %s with bd_block_size %u\n",
+		__func__,
+		bd->bd_disk->disk_name,
+		bd->bd_block_size);
+	kfree(reply);
+	kfree(msg);
+	return rv;
+
+out_err:
+	kfree(rv);
+	if (!IS_ERR(reply))
+		kfree(reply);
+	if (!IS_ERR(msg))
+		kfree(msg);
+	return NULL;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/34] pnfsblock: remove device operations
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (12 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 13/34] pnfsblock: add device operations Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 15/34] pnfsblock: lseg alloc and free Jim Rees
                   ` (19 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/Makefile        |    2 +-
 fs/nfs/blocklayout/blocklayout.h   |    2 +
 fs/nfs/blocklayout/blocklayoutdm.c |  120 ++++++++++++++++++++++++++++++++++++
 3 files changed, 123 insertions(+), 1 deletions(-)
 create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index bd69aad..bdbf180 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
 # Makefile for the pNFS block layout driver kernel module
 #
 obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o blocklayoutdev.o
+blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o blocklayoutdev.o blocklayoutdm.o
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 6c01d8c..e5ee11c 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -102,6 +102,8 @@ struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
 						struct list_head *sdlist);
 int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 				struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
+/* blocklayoutdm.c */
+void free_block_dev(struct pnfs_block_dev *bdev);
 
 #include <linux/sunrpc/simple_rpc_pipefs.h>
 
diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
new file mode 100644
index 0000000..097dd05
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -0,0 +1,120 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayoutdm.c
+ *
+ *  Module for the NFSv4.1 pNFS block layout driver.
+ *
+ *  Copyright (c) 2007 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Fred Isaman <iisaman@umich.edu>
+ *  Andy Adamson <andros@citi.umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+
+#include <linux/genhd.h> /* gendisk - used in a dprintk*/
+#include <linux/sched.h>
+#include <linux/hash.h>
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
+
+/* Defines used for calculating memory usage in nfs4_blk_flatten() */
+#define ARGSIZE   24    /* Max bytes needed for linear target arg string */
+#define SPECSIZE (sizeof8(struct dm_target_spec) + ARGSIZE)
+#define SPECS_PER_PAGE (PAGE_SIZE / SPECSIZE)
+#define SPEC_HEADER_ADJUST (SPECS_PER_PAGE - \
+			    (PAGE_SIZE - sizeof8(struct dm_ioctl)) / SPECSIZE)
+#define roundup8(x) (((x)+7) & ~7)
+#define sizeof8(x) roundup8(sizeof(x))
+
+static int dev_remove(dev_t dev)
+{
+	int ret = 1;
+	struct pipefs_hdr *msg = NULL, *reply = NULL;
+	uint64_t bl_dev;
+	uint32_t major = MAJOR(dev), minor = MINOR(dev);
+
+	dprintk("Entering %s\n", __func__);
+
+	if (IS_ERR(bl_device_pipe))
+		return ret;
+
+	memcpy((void *)&bl_dev, &major, sizeof(uint32_t));
+	memcpy((void *)&bl_dev + sizeof(uint32_t), &minor, sizeof(uint32_t));
+	msg = pipefs_alloc_init_msg(0, BL_DEVICE_UMOUNT, 0, (void *)&bl_dev,
+				    sizeof(uint64_t));
+	if (IS_ERR(msg)) {
+		dprintk("ERROR: couldn't make pipefs message.\n");
+		goto out;
+	}
+	msg->msgid = hash_ptr(&msg, sizeof(msg->msgid) * 8);
+	msg->status = BL_DEVICE_REQUEST_INIT;
+
+	reply = pipefs_queue_upcall_waitreply(bl_device_pipe, msg,
+					      &bl_device_list, 0, 0);
+	if (IS_ERR(reply)) {
+		dprintk("ERROR: upcall_waitreply failed\n");
+		goto out;
+	}
+
+	if (reply->status == BL_DEVICE_REQUEST_PROC)
+		ret = 0; /*TODO: what to return*/
+out:
+	if (!IS_ERR(reply))
+		kfree(reply);
+	if (!IS_ERR(msg))
+		kfree(msg);
+	return ret;
+}
+
+/*
+ * Release meta device
+ */
+static int nfs4_blk_metadev_release(struct pnfs_block_dev *bdev)
+{
+	int rv;
+
+	dprintk("%s Releasing\n", __func__);
+	/* XXX Check return? */
+	rv = nfs4_blkdev_put(bdev->bm_mdev);
+	dprintk("%s nfs4_blkdev_put returns %d\n", __func__, rv);
+
+	rv = dev_remove(bdev->bm_mdev->bd_dev);
+	dprintk("%s Returns %d\n", __func__, rv);
+	return rv;
+}
+
+void free_block_dev(struct pnfs_block_dev *bdev)
+{
+	if (bdev) {
+		if (bdev->bm_mdev) {
+			dprintk("%s Removing DM device: %d:%d\n",
+				__func__,
+				MAJOR(bdev->bm_mdev->bd_dev),
+				MINOR(bdev->bm_mdev->bd_dev));
+			/* XXX Check status ?? */
+			nfs4_blk_metadev_release(bdev);
+		}
+		kfree(bdev);
+	}
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 15/34] pnfsblock: lseg alloc and free
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (13 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 14/34] pnfsblock: remove " Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 16/34] pnfsblock: merge extents Jim Rees
                   ` (18 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Zhang Jingwang <Jingwang.Zhang@emc.com>
---
 fs/nfs/blocklayout/blocklayout.c    |   30 +++++++++++++++++++++++++++++-
 fs/nfs/blocklayout/blocklayout.h    |    6 ++++++
 fs/nfs/blocklayout/blocklayoutdev.c |    8 ++++++++
 3 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index a245e73..88b9d1a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -115,13 +115,41 @@ bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
 static void
 bl_free_lseg(struct pnfs_layout_segment *lseg)
 {
+        dprintk("%s enter\n", __func__);
+        kfree(lseg);
 }
 
+/* Because the generic infrastructure does not correctly merge layouts,
+ * we pretty much ignore lseg, and store all data layout wide, so we
+ * can correctly merge.  Eventually we should push some correct merge
+ * behavior up to the generic code, as the current behavior tends to
+ * cause lots of unnecessary overlapping LAYOUTGET requests.
+ */
 static struct pnfs_layout_segment *
 bl_alloc_lseg(struct pnfs_layout_hdr *lo,
 	      struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
 {
-	return NULL;
+        struct pnfs_layout_segment *lseg;
+        int status;
+
+        dprintk("%s enter\n", __func__);
+        lseg = kzalloc(sizeof(*lseg) + 0, gfp_flags);
+        if (!lseg)
+                return NULL;
+        status = nfs4_blk_process_layoutget(lo, lgr, gfp_flags);
+        if (status) {
+                /* We don't want to call the full-blown bl_free_lseg,
+                 * since on error extents were not touched.
+                 */
+                /* STUB - we really want to distinguish between 2 error
+                 * conditions here.  This lseg failed, but lo data structures
+                 * are OK, or we hosed the lo data structures.  The calling
+                 * code probably needs to distinguish this too.
+                 */
+                kfree(lseg);
+                return ERR_PTR(status);
+        }
+        return lseg;
 }
 
 static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index e5ee11c..3cd447f 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -94,6 +94,12 @@ BLK_LO2EXT(struct pnfs_layout_hdr *lo)
         return container_of(lo, struct pnfs_block_layout, bl_layout);
 }
 
+static inline struct pnfs_block_layout *
+BLK_LSEG2EXT(struct pnfs_layout_segment *lseg)
+{
+        return BLK_LO2EXT(lseg->pls_layout);
+}
+
 /* blocklayoutdev.c */
 struct block_device *nfs4_blkdev_get(dev_t dev);
 int nfs4_blkdev_put(struct block_device *bdev);
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 9a65a66..0fedf50 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -149,3 +149,11 @@ out_err:
 		kfree(msg);
 	return NULL;
 }
+
+int
+nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
+			   struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
+{
+	/* STUB */
+	return -EIO;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 16/34] pnfsblock: merge extents
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (14 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 15/34] pnfsblock: lseg alloc and free Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 17/34] pnfsblock: call and parse getdevicelist Jim Rees
                   ` (17 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Replace a stub, so that extents underlying the layouts are properly
added, merged, or ignored as necessary.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: delete the new node before put it]
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/blocklayout/blocklayout.h |   14 +++++-
 fs/nfs/blocklayout/extents.c     |  106 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 3cd447f..6bbfc3d 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -77,6 +77,14 @@ enum extentclass4 {
         EXTENT_LISTS    = 2,
 };
 
+static inline int choose_list(enum exstate4 state)
+{
+	if (state == PNFS_BLOCK_READ_DATA || state == PNFS_BLOCK_NONE_DATA)
+		return RO_EXTENT;
+	else
+		return RW_EXTENT;
+}
+
 struct pnfs_block_layout {
 	struct pnfs_layout_hdr bl_layout;
 	struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
@@ -110,6 +118,11 @@ int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 				struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
 /* blocklayoutdm.c */
 void free_block_dev(struct pnfs_block_dev *bdev);
+/* extents.c */
+void put_extent(struct pnfs_block_extent *be);
+struct pnfs_block_extent *alloc_extent(void);
+int add_and_merge_extent(struct pnfs_block_layout *bl,
+			 struct pnfs_block_extent *new);
 
 #include <linux/sunrpc/simple_rpc_pipefs.h>
 
@@ -125,5 +138,4 @@ void bl_pipe_exit(void);
 #define BL_DEVICE_REQUEST_PROC         0x1 /* User level process succeeds */
 #define BL_DEVICE_REQUEST_ERR          0x2 /* User level process fails */
 
-void put_extent(struct pnfs_block_extent *be);
 #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 1283fa9..26c263f 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -95,3 +95,109 @@ void print_elist(struct list_head *list)
 	}
 	dprintk("****************\n");
 }
+
+static inline int
+extents_consistent(struct pnfs_block_extent *old, struct pnfs_block_extent *new)
+{
+	/* Note this assumes new->be_f_offset >= old->be_f_offset */
+	return (new->be_state == old->be_state) &&
+		((new->be_state == PNFS_BLOCK_NONE_DATA) ||
+		 ((new->be_v_offset - old->be_v_offset ==
+		   new->be_f_offset - old->be_f_offset) &&
+		  new->be_mdev == old->be_mdev));
+}
+
+/* Adds new to appropriate list in bl, modifying new and removing existing
+ * extents as appropriate to deal with overlaps.
+ *
+ * See find_get_extent for list constraints.
+ *
+ * Refcount on new is already set.  If end up not using it, or error out,
+ * need to put the reference.
+ *
+ * Lock is held by caller.
+ */
+int
+add_and_merge_extent(struct pnfs_block_layout *bl,
+		     struct pnfs_block_extent *new)
+{
+	struct pnfs_block_extent *be, *tmp;
+	sector_t end = new->be_f_offset + new->be_length;
+	struct list_head *list;
+
+	dprintk("%s enter with be=%p\n", __func__, new);
+	print_bl_extent(new);
+	list = &bl->bl_extents[choose_list(new->be_state)];
+	print_elist(list);
+
+	/* Scan for proper place to insert, extending new to the left
+	 * as much as possible.
+	 */
+	list_for_each_entry_safe(be, tmp, list, be_node) {
+		if (new->be_f_offset < be->be_f_offset)
+			break;
+		if (end <= be->be_f_offset + be->be_length) {
+			/* new is a subset of existing be*/
+			if (extents_consistent(be, new)) {
+				dprintk("%s: new is subset, ignoring\n",
+					__func__);
+				put_extent(new);
+				return 0;
+			} else
+				goto out_err;
+		} else if (new->be_f_offset <=
+				be->be_f_offset + be->be_length) {
+			/* new overlaps or abuts existing be */
+			if (extents_consistent(be, new)) {
+				/* extend new to fully replace be */
+				new->be_length += new->be_f_offset -
+						  be->be_f_offset;
+				new->be_f_offset = be->be_f_offset;
+				new->be_v_offset = be->be_v_offset;
+				dprintk("%s: removing %p\n", __func__, be);
+				list_del(&be->be_node);
+				put_extent(be);
+			} else if (new->be_f_offset !=
+				   be->be_f_offset + be->be_length)
+				goto out_err;
+		}
+	}
+	/* Note that if we never hit the above break, be will not point to a
+	 * valid extent.  However, in that case &be->be_node==list.
+	 */
+	list_add_tail(&new->be_node, &be->be_node);
+	dprintk("%s: inserting new\n", __func__);
+	print_elist(list);
+	/* Scan forward for overlaps.  If we find any, extend new and
+	 * remove the overlapped extent.
+	 */
+	be = list_prepare_entry(new, list, be_node);
+	list_for_each_entry_safe_continue(be, tmp, list, be_node) {
+		if (end < be->be_f_offset)
+			break;
+		/* new overlaps or abuts existing be */
+		if (extents_consistent(be, new)) {
+			if (end < be->be_f_offset + be->be_length) {
+				/* extend new to fully cover be */
+				end = be->be_f_offset + be->be_length;
+				new->be_length = end - new->be_f_offset;
+			}
+			dprintk("%s: removing %p\n", __func__, be);
+			list_del(&be->be_node);
+			put_extent(be);
+		} else if (end != be->be_f_offset) {
+			list_del(&new->be_node);
+			goto out_err;
+		}
+	}
+	dprintk("%s: after merging\n", __func__);
+	print_elist(list);
+	/* STUB - The per-list consistency checks have all been done,
+	 * should now check cross-list consistency.
+	 */
+	return 0;
+
+ out_err:
+	put_extent(new);
+	return -EIO;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 17/34] pnfsblock: call and parse getdevicelist
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (15 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 16/34] pnfsblock: merge extents Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-14 15:36   ` Benny Halevy
  2011-06-12 23:44 ` [PATCH 18/34] pnfsblock: allow use of PG_owner_priv_1 flag Jim Rees
                   ` (16 subsequent siblings)
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO
for each device returned.

[pnfsblock: fix pnfs_deviceid references]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix print format warnings for sector_t and size_t]
[pnfs-block: #include <linux/vmalloc.h>]
[pnfsblock: no PNFS_NFS_SERVER]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfsblock: fix bug determining size of striped volume]
[pnfsblock: fix oops when using multiple devices]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |  155 +++++++++++++++++++++++++++++++++++++-
 fs/nfs/blocklayout/blocklayout.h |   95 +++++++++++++++++++++++
 2 files changed, 248 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 88b9d1a..36374f4 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -31,7 +31,7 @@
  */
 #include <linux/module.h>
 #include <linux/init.h>
-
+#include <linux/vmalloc.h>
 #include "blocklayout.h"
 
 #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
@@ -164,17 +164,168 @@ bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
 {
 }
 
+static void free_blk_mountid(struct block_mount_id *mid)
+{
+	if (mid) {
+		struct pnfs_block_dev *dev;
+		spin_lock(&mid->bm_lock);
+		while (!list_empty(&mid->bm_devlist)) {
+			dev = list_first_entry(&mid->bm_devlist,
+					       struct pnfs_block_dev,
+					       bm_node);
+			list_del(&dev->bm_node);
+			free_block_dev(dev);
+		}
+		spin_unlock(&mid->bm_lock);
+		kfree(mid);
+	}
+}
+
+/* This is mostly copied from the filelayout's get_device_info function.
+ * It seems much of this should be at the generic pnfs level.
+ */
+static struct pnfs_block_dev *
+nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
+			struct nfs4_deviceid *d_id,
+			struct list_head *sdlist)
+{
+	struct pnfs_device *dev;
+	struct pnfs_block_dev *rv = NULL;
+	u32 max_resp_sz;
+	int max_pages;
+	struct page **pages = NULL;
+	int i, rc;
+
+	/*
+	 * Use the session max response size as the basis for setting
+	 * GETDEVICEINFO's maxcount
+	 */
+	max_resp_sz = server->nfs_client->cl_session->fc_attrs.max_resp_sz;
+	max_pages = max_resp_sz >> PAGE_SHIFT;
+	dprintk("%s max_resp_sz %u max_pages %d\n",
+		__func__, max_resp_sz, max_pages);
+
+	dev = kmalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev) {
+		dprintk("%s kmalloc failed\n", __func__);
+		return NULL;
+	}
+
+	pages = kzalloc(max_pages * sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		kfree(dev);
+		return NULL;
+	}
+	for (i = 0; i < max_pages; i++) {
+		pages[i] = alloc_page(GFP_KERNEL);
+		if (!pages[i])
+			goto out_free;
+	}
+
+	/* set dev->area */
+	dev->area = vmap(pages, max_pages, VM_MAP, PAGE_KERNEL);
+	if (!dev->area)
+		goto out_free;
+
+	memcpy(&dev->dev_id, d_id, sizeof(*d_id));
+	dev->layout_type = LAYOUT_BLOCK_VOLUME;
+	dev->pages = pages;
+	dev->pgbase = 0;
+	dev->pglen = PAGE_SIZE * max_pages;
+	dev->mincount = 0;
+
+	dprintk("%s: dev_id: %s\n", __func__, dev->dev_id.data);
+	rc = nfs4_proc_getdeviceinfo(server, dev);
+	dprintk("%s getdevice info returns %d\n", __func__, rc);
+	if (rc)
+		goto out_free;
+
+	rv = nfs4_blk_decode_device(server, dev, sdlist);
+ out_free:
+	if (dev->area != NULL)
+		vunmap(dev->area);
+	for (i = 0; i < max_pages; i++)
+		__free_page(pages[i]);
+	kfree(pages);
+	kfree(dev);
+	return rv;
+}
+
 static int
 bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
 {
+	struct block_mount_id *b_mt_id = NULL;
+	struct pnfs_mount_type *mtype = NULL;
+	struct pnfs_devicelist *dlist = NULL;
+	struct pnfs_block_dev *bdev;
+	LIST_HEAD(block_disklist);
+	int status = 0, i;
+
 	dprintk("%s enter\n", __func__);
-	return 0;
+
+	if (server->pnfs_blksize == 0) {
+		dprintk("%s Server did not return blksize\n", __func__);
+		return -EINVAL;
+	}
+	b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_KERNEL);
+	if (!b_mt_id) {
+		status = -ENOMEM;
+		goto out_error;
+	}
+	/* Initialize nfs4 block layout mount id */
+	spin_lock_init(&b_mt_id->bm_lock);
+	INIT_LIST_HEAD(&b_mt_id->bm_devlist);
+
+	dlist = kmalloc(sizeof(struct pnfs_devicelist), GFP_KERNEL);
+	if (!dlist)
+		goto out_error;
+	dlist->eof = 0;
+	while (!dlist->eof) {
+		status = nfs4_proc_getdevicelist(server, fh, dlist);
+		if (status)
+			goto out_error;
+		dprintk("%s GETDEVICELIST numdevs=%i, eof=%i\n",
+			__func__, dlist->num_devs, dlist->eof);
+		/* For each device returned in dlist, call GETDEVICEINFO, and
+		 * decode the opaque topology encoding to create a flat
+		 * volume topology, matching VOLUME_SIMPLE disk signatures
+		 * to disks in the visible block disk list.
+		 * Construct an LVM meta device from the flat volume topology.
+		 */
+		for (i = 0; i < dlist->num_devs; i++) {
+			bdev = nfs4_blk_get_deviceinfo(server, fh,
+						     &dlist->dev_id[i],
+						     &block_disklist);
+			if (!bdev) {
+				status = -ENODEV;
+				goto out_error;
+			}
+			spin_lock(&b_mt_id->bm_lock);
+			list_add(&bdev->bm_node, &b_mt_id->bm_devlist);
+			spin_unlock(&b_mt_id->bm_lock);
+		}
+	}
+	dprintk("%s SUCCESS\n", __func__);
+	server->pnfs_ld_data = b_mt_id;
+
+ out_return:
+	kfree(dlist);
+	return status;
+
+ out_error:
+	free_blk_mountid(b_mt_id);
+	kfree(mtype);
+	goto out_return;
 }
 
 static int
 bl_clear_layoutdriver(struct nfs_server *server)
 {
+	struct block_mount_id *b_mt_id = server->pnfs_ld_data;
+
 	dprintk("%s enter\n", __func__);
+	free_blk_mountid(b_mt_id);
+	dprintk("%s RETURNS\n", __func__);
 	return 0;
 }
 
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 6bbfc3d..21fa21c 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,12 +35,60 @@
 #include <linux/nfs_fs.h>
 #include "../pnfs.h"
 
+struct block_mount_id {
+	spinlock_t			bm_lock;    /* protects list */
+	struct list_head		bm_devlist; /* holds pnfs_block_dev */
+};
+
 struct pnfs_block_dev {
 	struct list_head		bm_node;
 	struct nfs4_deviceid		bm_mdevid;    /* associated devid */
 	struct block_device		*bm_mdev;     /* meta device itself */
 };
 
+/* holds visible disks that can be matched against VOLUME_SIMPLE signatures */
+struct visible_block_device {
+	struct list_head	vi_node;
+	struct block_device	*vi_bdev;
+	int			vi_mapped;
+	int			vi_put_done;
+};
+
+enum blk_vol_type {
+	PNFS_BLOCK_VOLUME_SIMPLE   = 0,	/* maps to a single LU */
+	PNFS_BLOCK_VOLUME_SLICE    = 1,	/* slice of another volume */
+	PNFS_BLOCK_VOLUME_CONCAT   = 2,	/* concatenation of multiple volumes */
+	PNFS_BLOCK_VOLUME_STRIPE   = 3	/* striped across multiple volumes */
+};
+
+/* All disk offset/lengths are stored in 512-byte sectors */
+struct pnfs_blk_volume {
+	uint32_t		bv_type;
+	sector_t 		bv_size;
+	struct pnfs_blk_volume 	**bv_vols;
+	int 			bv_vol_n;
+	union {
+		dev_t			bv_dev;
+		sector_t		bv_stripe_unit;
+		sector_t 		bv_offset;
+	};
+};
+
+/* Since components need not be aligned, cannot use sector_t */
+struct pnfs_blk_sig_comp {
+	int64_t 	bs_offset;  /* In bytes */
+	uint32_t   	bs_length;  /* In bytes */
+	char 		*bs_string;
+};
+
+/* Maximum number of signatures components in a simple volume */
+# define PNFS_BLOCK_MAX_SIG_COMP 16
+
+struct pnfs_blk_sig {
+	int 				si_num_comps;
+	struct pnfs_blk_sig_comp	si_comps[PNFS_BLOCK_MAX_SIG_COMP];
+};
+
 enum exstate4 {
 	PNFS_BLOCK_READWRITE_DATA	= 0,
 	PNFS_BLOCK_READ_DATA		= 1,
@@ -96,6 +144,8 @@ struct pnfs_block_layout {
 	sector_t		bl_blocksize;  /* Server blocksize in sectors */
 };
 
+#define BLK_ID(lo) ((struct block_mount_id *)(NFS_SERVER(lo->plh_inode)->pnfs_ld_data))
+
 static inline struct pnfs_block_layout *
 BLK_LO2EXT(struct pnfs_layout_hdr *lo)
 {
@@ -108,6 +158,51 @@ BLK_LSEG2EXT(struct pnfs_layout_segment *lseg)
         return BLK_LO2EXT(lseg->pls_layout);
 }
 
+uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
+
+#define BLK_READBUF(p, e, nbytes)  do { \
+	p = blk_overflow(p, e, nbytes); \
+	if (!p) { \
+		printk(KERN_WARNING \
+			"%s: reply buffer overflowed in line %d.\n", \
+			__func__, __LINE__); \
+		goto out_err; \
+	} \
+} while (0)
+
+#define READ32(x)         (x) = ntohl(*p++)
+#define READ64(x)         do {                  \
+	(x) = (uint64_t)ntohl(*p++) << 32;           \
+	(x) |= ntohl(*p++);                     \
+} while (0)
+#define COPYMEM(x, nbytes) do {                 \
+	memcpy((x), p, nbytes);                 \
+	p += XDR_QUADLEN(nbytes);               \
+} while (0)
+#define READ_DEVID(x)	COPYMEM((x)->data, NFS4_DEVICEID4_SIZE)
+#define READ_SECTOR(x)     do { \
+	READ64(tmp); \
+	if (tmp & 0x1ff) { \
+		printk(KERN_WARNING \
+		       "%s Value not 512-byte aligned at line %d\n", \
+		       __func__, __LINE__);			     \
+		goto out_err; \
+	} \
+	(x) = tmp >> 9; \
+} while (0)
+
+#define WRITE32(n)               do { \
+	*p++ = htonl(n); \
+	} while (0)
+#define WRITE64(n)               do {                           \
+	*p++ = htonl((uint32_t)((n) >> 32));			\
+	*p++ = htonl((uint32_t)(n));				\
+} while (0)
+#define WRITEMEM(ptr, nbytes)     do {                          \
+	p = xdr_encode_opaque_fixed(p, ptr, nbytes);	\
+} while (0)
+#define WRITE_DEVID(x)  WRITEMEM((x)->data, NFS4_DEVICEID4_SIZE)
+
 /* blocklayoutdev.c */
 struct block_device *nfs4_blkdev_get(dev_t dev);
 int nfs4_blkdev_put(struct block_device *bdev);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 18/34] pnfsblock: allow use of PG_owner_priv_1 flag
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (16 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 17/34] pnfsblock: call and parse getdevicelist Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-13 15:56   ` Fred Isaman
  2011-06-12 23:44 ` [PATCH 19/34] pnfsblock: xdr decode pnfs_block_layout4 Jim Rees
                   ` (15 subsequent siblings)
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

There is currently no good way for pnfs to communicate problems.  For
example - the linux read code first tries to do readahead through
nfs_readpages. Failure there is ignored, and it will later call
nfs_readpage.  Failure there is also ignored, except that the lack of
PG_uptodate is communicated back via -EIO.

With pnfs, it would be useful to be able to communicate to
nfs_readpage that direct disk IO failed on readahead, and that it
should failover to using the MDS.

Making the page flag PG_owner_priv_1 available as PG_pnfserr is one
way to do so. (An alternative would be to embed this in the layout,
but then pg_test can't easily access the info.)

This may be better as generic pnfs code, in which case it should be
put in pnfs.h, or even page-flags.h

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 21fa21c..293f009 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,6 +35,11 @@
 #include <linux/nfs_fs.h>
 #include "../pnfs.h"
 
+#define PG_pnfserr PG_owner_priv_1
+#define PagePnfsErr(page)	test_bit(PG_pnfserr, &(page)->flags)
+#define SetPagePnfsErr(page)	set_bit(PG_pnfserr, &(page)->flags)
+#define ClearPagePnfsErr(page)	clear_bit(PG_pnfserr, &(page)->flags)
+
 struct block_mount_id {
 	spinlock_t			bm_lock;    /* protects list */
 	struct list_head		bm_devlist; /* holds pnfs_block_dev */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 19/34] pnfsblock: xdr decode pnfs_block_layout4
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (17 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 18/34] pnfsblock: allow use of PG_owner_priv_1 flag Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 20/34] pnfsblock: find_get_extent Jim Rees
                   ` (14 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

XDR decodes the block layout payload sent in LAYOUTGET result, storing
the result in an extent list.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayoutdev.c |  191 ++++++++++++++++++++++++++++++++++-
 1 files changed, 189 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 0fedf50..a90eb6b 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -150,10 +150,197 @@ out_err:
 	return NULL;
 }
 
+/* Map deviceid returned by the server to constructed block_device */
+static struct block_device *translate_devid(struct pnfs_layout_hdr *lo,
+					    struct nfs4_deviceid *id)
+{
+	struct block_device *rv = NULL;
+	struct block_mount_id *mid;
+	struct pnfs_block_dev *dev;
+
+	dprintk("%s enter, lo=%p, id=%p\n", __func__, lo, id);
+	mid = BLK_ID(lo);
+	spin_lock(&mid->bm_lock);
+	list_for_each_entry(dev, &mid->bm_devlist, bm_node) {
+		if (memcmp(id->data, dev->bm_mdevid.data,
+			   NFS4_DEVICEID4_SIZE) == 0) {
+			rv = dev->bm_mdev;
+			goto out;
+		}
+	}
+ out:
+	spin_unlock(&mid->bm_lock);
+	dprintk("%s returning %p\n", __func__, rv);
+	return rv;
+}
+
+/* Tracks info needed to ensure extents in layout obey constraints of spec */
+struct layout_verification {
+	u32 mode;	/* R or RW */
+	u64 start;	/* Expected start of next non-COW extent */
+	u64 inval;	/* Start of INVAL coverage */
+	u64 cowread;	/* End of COW read coverage */
+};
+
+/* Verify the extent meets the layout requirements of the pnfs-block draft,
+ * section 2.3.1.
+ */
+static int verify_extent(struct pnfs_block_extent *be,
+			 struct layout_verification *lv)
+{
+	if (lv->mode == IOMODE_READ) {
+		if (be->be_state == PNFS_BLOCK_READWRITE_DATA ||
+		    be->be_state == PNFS_BLOCK_INVALID_DATA)
+			return -EIO;
+		if (be->be_f_offset != lv->start)
+			return -EIO;
+		lv->start += be->be_length;
+		return 0;
+	}
+	/* lv->mode == IOMODE_RW */
+	if (be->be_state == PNFS_BLOCK_READWRITE_DATA) {
+		if (be->be_f_offset != lv->start)
+			return -EIO;
+		if (lv->cowread > lv->start)
+			return -EIO;
+		lv->start += be->be_length;
+		lv->inval = lv->start;
+		return 0;
+	} else if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+		if (be->be_f_offset != lv->start)
+			return -EIO;
+		lv->start += be->be_length;
+		return 0;
+	} else if (be->be_state == PNFS_BLOCK_READ_DATA) {
+		if (be->be_f_offset > lv->start)
+			return -EIO;
+		if (be->be_f_offset < lv->inval)
+			return -EIO;
+		if (be->be_f_offset < lv->cowread)
+			return -EIO;
+		/* It looks like you might want to min this with lv->start,
+		 * but you really don't.
+		 */
+		lv->inval = lv->inval + be->be_length;
+		lv->cowread = be->be_f_offset + be->be_length;
+		return 0;
+	} else
+		return -EIO;
+}
+
+/* XDR decode pnfs_block_layout4 structure */
 int
 nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 			   struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
 {
-	/* STUB */
-	return -EIO;
+	struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
+	int i, status = -EIO;
+	uint32_t count;
+	struct pnfs_block_extent *be = NULL, *save;
+	struct xdr_stream stream;
+	struct xdr_buf buf;
+	struct page *scratch;
+	__be32 *p;
+	uint64_t tmp; /* Used by READSECTOR */
+	struct layout_verification lv = {
+		.mode = lgr->range.iomode,
+		.start = lgr->range.offset >> 9,
+		.inval = lgr->range.offset >> 9,
+		.cowread = lgr->range.offset >> 9,
+	};
+	LIST_HEAD(extents);
+
+	dprintk("---> %s\n", __func__);
+
+	scratch = alloc_page(gfp_flags);
+	if (!scratch)
+		return -ENOMEM;
+
+	xdr_init_decode_pages(&stream, &buf, lgr->layoutp->pages, lgr->layoutp->len);
+	xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
+
+	p = xdr_inline_decode(&stream, 4);
+	if (unlikely(!p))
+		goto out_err;
+
+	READ32(count);
+
+	dprintk("%s enter, number of extents %i\n", __func__, count);
+	p = xdr_inline_decode(&stream, (28 + NFS4_DEVICEID4_SIZE) * count);
+	if (unlikely(!p))
+		goto out_err;
+
+	/* Decode individual extents, putting them in temporary
+	 * staging area until whole layout is decoded to make error
+	 * recovery easier.
+	 */
+	for (i = 0; i < count; i++) {
+		be = alloc_extent();
+		if (!be) {
+			status = -ENOMEM;
+			goto out_err;
+		}
+		READ_DEVID(&be->be_devid);
+		be->be_mdev = translate_devid(lo, &be->be_devid);
+		if (!be->be_mdev)
+			goto out_err;
+
+		/* The next three values are read in as bytes,
+		 * but stored as 512-byte sector lengths
+		 */
+		READ_SECTOR(be->be_f_offset);
+		READ_SECTOR(be->be_length);
+		READ_SECTOR(be->be_v_offset);
+		READ32(be->be_state);
+		if (be->be_state == PNFS_BLOCK_INVALID_DATA)
+			be->be_inval = &bl->bl_inval;
+		if (verify_extent(be, &lv)) {
+			dprintk("%s verify failed\n", __func__);
+			goto out_err;
+		}
+		list_add_tail(&be->be_node, &extents);
+	}
+	if (lgr->range.offset + lgr->range.length != lv.start << 9) {
+		dprintk("%s Final length mismatch\n", __func__);
+		be = NULL;
+		goto out_err;
+	}
+	if (lv.start < lv.cowread) {
+		dprintk("%s Final uncovered COW extent\n", __func__);
+		be = NULL;
+		goto out_err;
+	}
+	/* Extents decoded properly, now try to merge them in to
+	 * existing layout extents.
+	 */
+	spin_lock(&bl->bl_ext_lock);
+	list_for_each_entry_safe(be, save, &extents, be_node) {
+		list_del(&be->be_node);
+		status = add_and_merge_extent(bl, be);
+		if (status) {
+			spin_unlock(&bl->bl_ext_lock);
+			/* This is a fairly catastrophic error, as the
+			 * entire layout extent lists are now corrupted.
+			 * We should have some way to distinguish this.
+			 */
+			be = NULL;
+			goto out_err;
+		}
+	}
+	spin_unlock(&bl->bl_ext_lock);
+	status = 0;
+ out:
+	__free_page(scratch);
+	dprintk("%s returns %i\n", __func__, status);
+	return status;
+
+ out_err:
+	put_extent(be);
+	while (!list_empty(&extents)) {
+		be = list_first_entry(&extents, struct pnfs_block_extent,
+				      be_node);
+		list_del(&be->be_node);
+		put_extent(be);
+	}
+	goto out;
 }
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 20/34] pnfsblock: find_get_extent
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (18 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 19/34] pnfsblock: xdr decode pnfs_block_layout4 Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 21/34] pnfsblock: SPLITME: add extent manipulation functions Jim Rees
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred <iisaman@citi.umich.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implement find_get_extent(), one of the core extent manipulation
routines.

[pnfsblock: Lookup list entry of layouts and tags in reverse order]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>

pnfsblock: fix print format warnings for sector_t and size_t

gcc spews warnings about these on x86_64, e.g.:
fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’
fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.h |    3 ++
 fs/nfs/blocklayout/extents.c     |   47 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 293f009..06aa36a 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -219,6 +219,9 @@ int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 /* blocklayoutdm.c */
 void free_block_dev(struct pnfs_block_dev *bdev);
 /* extents.c */
+struct pnfs_block_extent *
+find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
+		struct pnfs_block_extent **cow_read);
 void put_extent(struct pnfs_block_extent *be);
 struct pnfs_block_extent *alloc_extent(void);
 int add_and_merge_extent(struct pnfs_block_layout *bl,
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 26c263f..f0b3f13 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -201,3 +201,50 @@ add_and_merge_extent(struct pnfs_block_layout *bl,
 	put_extent(new);
 	return -EIO;
 }
+
+/* Returns extent, or NULL.  If a second READ extent exists, it is returned
+ * in cow_read, if given.
+ *
+ * The extents are kept in two seperate ordered lists, one for READ and NONE,
+ * one for READWRITE and INVALID.  Within each list, we assume:
+ * 1. Extents are ordered by file offset.
+ * 2. For any given isect, there is at most one extents that matches.
+ */
+struct pnfs_block_extent *
+find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
+	    struct pnfs_block_extent **cow_read)
+{
+	struct pnfs_block_extent *be, *cow, *ret;
+	int i;
+
+	dprintk("%s enter with isect %llu\n", __func__, (u64)isect);
+	cow = ret = NULL;
+	spin_lock(&bl->bl_ext_lock);
+	for (i = 0; i < EXTENT_LISTS; i++) {
+		if (ret &&
+		    (!cow_read || ret->be_state != PNFS_BLOCK_INVALID_DATA))
+			break;
+		list_for_each_entry_reverse(be, &bl->bl_extents[i], be_node) {
+			if (isect >= be->be_f_offset + be->be_length)
+				break;
+			if (isect >= be->be_f_offset) {
+				/* We have found an extent */
+				dprintk("%s Get %p (%i)\n", __func__, be,
+					atomic_read(&be->be_refcnt.refcount));
+				kref_get(&be->be_refcnt);
+				if (!ret)
+					ret = be;
+				else if (be->be_state != PNFS_BLOCK_READ_DATA)
+					put_extent(be);
+				else
+					cow = be;
+				break;
+			}
+		}
+	}
+	spin_unlock(&bl->bl_ext_lock);
+	if (cow_read)
+		*cow_read = cow;
+	print_bl_extent(ret);
+	return ret;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 21/34] pnfsblock: SPLITME: add extent manipulation functions
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (19 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 20/34] pnfsblock: find_get_extent Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-14 15:40   ` Benny Halevy
  2011-06-12 23:44 ` [PATCH 22/34] pnfsblock: merge rw extents Jim Rees
                   ` (12 subsequent siblings)
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Adds working implementations of various support functions
to handle INVAL extents, needed by writes, such as
mark_initialized_sectors and is_sector_initialized.

SPLIT: this needs to be split into the exported functions, and the
range support functions (which will be replaced eventually.)

[pnfsblock: fix 64-bit compiler warnings for extent manipulation]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.h |   30 ++++-
 fs/nfs/blocklayout/extents.c     |  253 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 281 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 06aa36a..a231d49 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,6 +35,8 @@
 #include <linux/nfs_fs.h>
 #include "../pnfs.h"
 
+#define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)
+
 #define PG_pnfserr PG_owner_priv_1
 #define PagePnfsErr(page)	test_bit(PG_pnfserr, &(page)->flags)
 #define SetPagePnfsErr(page)	set_bit(PG_pnfserr, &(page)->flags)
@@ -101,8 +103,23 @@ enum exstate4 {
 	PNFS_BLOCK_NONE_DATA		= 3  /* unmapped, it's a hole */
 };
 
+#define MY_MAX_TAGS (15) /* tag bitnums used must be less than this */
+
+struct my_tree_t {
+	sector_t		mtt_step_size;	/* Internal sector alignment */
+	struct list_head	mtt_stub; /* Should be a radix tree */
+};
+
 struct pnfs_inval_markings {
-	/* STUB */
+	spinlock_t	im_lock;
+	struct my_tree_t im_tree;	/* Sectors that need LAYOUTCOMMIT */
+	sector_t	im_block_size;	/* Server blocksize in sectors */
+};
+
+struct pnfs_inval_tracking {
+	struct list_head it_link;
+	int		 it_sector;
+	int		 it_tags;
 };
 
 /* sector_t fields are all in 512-byte sectors */
@@ -121,7 +138,11 @@ struct pnfs_block_extent {
 static inline void
 INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
 {
-	/* STUB */
+	spin_lock_init(&marks->im_lock);
+	INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
+	marks->im_block_size = blocksize;
+	marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
+					   blocksize);
 }
 
 enum extentclass4 {
@@ -222,8 +243,13 @@ void free_block_dev(struct pnfs_block_dev *bdev);
 struct pnfs_block_extent *
 find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
 		struct pnfs_block_extent **cow_read);
+int mark_initialized_sectors(struct pnfs_inval_markings *marks,
+			     sector_t offset, sector_t length,
+			     sector_t **pages);
 void put_extent(struct pnfs_block_extent *be);
 struct pnfs_block_extent *alloc_extent(void);
+struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
+int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
 int add_and_merge_extent(struct pnfs_block_layout *bl,
 			 struct pnfs_block_extent *new);
 
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index f0b3f13..3d36f66 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -33,6 +33,259 @@
 #include "blocklayout.h"
 #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
 
+/* Bit numbers */
+#define EXTENT_INITIALIZED 0
+#define EXTENT_WRITTEN     1
+#define EXTENT_IN_COMMIT   2
+#define INTERNAL_EXISTS    MY_MAX_TAGS
+#define INTERNAL_MASK      ((1 << INTERNAL_EXISTS) - 1)
+
+/* Returns largest t<=s s.t. t%base==0 */
+static inline sector_t normalize(sector_t s, int base)
+{
+	sector_t tmp = s; /* Since do_div modifies its argument */
+	return s - do_div(tmp, base);
+}
+
+static inline sector_t normalize_up(sector_t s, int base)
+{
+	return normalize(s + base - 1, base);
+}
+
+/* Complete stub using list while determine API wanted */
+
+/* Returns tags, or negative */
+static int32_t _find_entry(struct my_tree_t *tree, u64 s)
+{
+	struct pnfs_inval_tracking *pos;
+
+	dprintk("%s(%llu) enter\n", __func__, s);
+	list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+		if (pos->it_sector > s)
+			continue;
+		else if (pos->it_sector == s)
+			return pos->it_tags & INTERNAL_MASK;
+		else
+			break;
+	}
+	return -ENOENT;
+}
+
+static inline
+int _has_tag(struct my_tree_t *tree, u64 s, int32_t tag)
+{
+	int32_t tags;
+
+	dprintk("%s(%llu, %i) enter\n", __func__, s, tag);
+	s = normalize(s, tree->mtt_step_size);
+	tags = _find_entry(tree, s);
+	if ((tags < 0) || !(tags & (1 << tag)))
+		return 0;
+	else
+		return 1;
+}
+
+/* Creates entry with tag, or if entry already exists, unions tag to it.
+ * If storage is not NULL, newly created entry will use it.
+ * Returns number of entries added, or negative on error.
+ */
+static int _add_entry(struct my_tree_t *tree, u64 s, int32_t tag,
+		      struct pnfs_inval_tracking *storage)
+{
+	int found = 0;
+	struct pnfs_inval_tracking *pos;
+
+	dprintk("%s(%llu, %i, %p) enter\n", __func__, s, tag, storage);
+	list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+		if (pos->it_sector > s)
+			continue;
+		else if (pos->it_sector == s) {
+			found = 1;
+			break;
+		} else
+			break;
+	}
+	if (found) {
+		pos->it_tags |= (1 << tag);
+		return 0;
+	} else {
+		struct pnfs_inval_tracking *new;
+		if (storage)
+			new = storage;
+		else {
+			new = kmalloc(sizeof(*new), GFP_KERNEL);
+			if (!new)
+				return -ENOMEM;
+		}
+		new->it_sector = s;
+		new->it_tags = (1 << tag);
+		list_add(&new->it_link, &pos->it_link);
+		return 1;
+	}
+}
+
+/* XXXX Really want option to not create */
+/* Over range, unions tag with existing entries, else creates entry with tag */
+static int _set_range(struct my_tree_t *tree, int32_t tag, u64 s, u64 length)
+{
+	u64 i;
+
+	dprintk("%s(%i, %llu, %llu) enter\n", __func__, tag, s, length);
+	for (i = normalize(s, tree->mtt_step_size); i < s + length;
+	     i += tree->mtt_step_size)
+		if (_add_entry(tree, i, tag, NULL))
+			return -ENOMEM;
+	return 0;
+}
+
+/* Ensure that future operations on given range of tree will not malloc */
+static int _preload_range(struct my_tree_t *tree, u64 offset, u64 length)
+{
+	u64 start, end, s;
+	int count, i, used = 0, status = -ENOMEM;
+	struct pnfs_inval_tracking **storage;
+
+	dprintk("%s(%llu, %llu) enter\n", __func__, offset, length);
+	start = normalize(offset, tree->mtt_step_size);
+	end = normalize_up(offset + length, tree->mtt_step_size);
+	count = (int)(end - start) / (int)tree->mtt_step_size;
+
+	/* Pre-malloc what memory we might need */
+	storage = kmalloc(sizeof(*storage) * count, GFP_KERNEL);
+	if (!storage)
+		return -ENOMEM;
+	for (i = 0; i < count; i++) {
+		storage[i] = kmalloc(sizeof(struct pnfs_inval_tracking),
+				     GFP_KERNEL);
+		if (!storage[i])
+			goto out_cleanup;
+	}
+
+	/* Now need lock - HOW??? */
+
+	for (s = start; s < end; s += tree->mtt_step_size)
+		used += _add_entry(tree, s, INTERNAL_EXISTS, storage[used]);
+
+	/* Unlock - HOW??? */
+	status = 0;
+
+ out_cleanup:
+	for (i = used; i < count; i++) {
+		if (!storage[i])
+			break;
+		kfree(storage[i]);
+	}
+	kfree(storage);
+	return status;
+}
+
+static void set_needs_init(sector_t *array, sector_t offset)
+{
+	sector_t *p = array;
+
+	dprintk("%s enter\n", __func__);
+	if (!p)
+		return;
+	while (*p < offset)
+		p++;
+	if (*p == offset)
+		return;
+	else if (*p == ~0) {
+		*p++ = offset;
+		*p = ~0;
+		return;
+	} else {
+		sector_t *save = p;
+		dprintk("%s Adding %llu\n", __func__, (u64)offset);
+		while (*p != ~0)
+			p++;
+		p++;
+		memmove(save + 1, save, (char *)p - (char *)save);
+		*save = offset;
+		return;
+	}
+}
+
+/* We are relying on page lock to serialize this */
+int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
+{
+	int rv;
+
+	spin_lock(&marks->im_lock);
+	rv = _has_tag(&marks->im_tree, isect, EXTENT_INITIALIZED);
+	spin_unlock(&marks->im_lock);
+	return rv;
+}
+
+/* Marks sectors in [offest, offset_length) as having been initialized.
+ * All lengths are step-aligned, where step is min(pagesize, blocksize).
+ * Notes where partial block is initialized, and helps prepare it for
+ * complete initialization later.
+ */
+/* Currently assumes offset is page-aligned */
+int mark_initialized_sectors(struct pnfs_inval_markings *marks,
+			     sector_t offset, sector_t length,
+			     sector_t **pages)
+{
+	sector_t s, start, end;
+	sector_t *array = NULL; /* Pages to mark */
+
+	dprintk("%s(offset=%llu,len=%llu) enter\n",
+		__func__, (u64)offset, (u64)length);
+	s = max((sector_t) 3,
+		2 * (marks->im_block_size / (PAGE_CACHE_SECTORS)));
+	dprintk("%s set max=%llu\n", __func__, (u64)s);
+	if (pages) {
+		array = kmalloc(s * sizeof(sector_t), GFP_KERNEL);
+		if (!array)
+			goto outerr;
+		array[0] = ~0;
+	}
+
+	start = normalize(offset, marks->im_block_size);
+	end = normalize_up(offset + length, marks->im_block_size);
+	if (_preload_range(&marks->im_tree, start, end - start))
+		goto outerr;
+
+	spin_lock(&marks->im_lock);
+
+	for (s = normalize_up(start, PAGE_CACHE_SECTORS);
+	     s < offset; s += PAGE_CACHE_SECTORS) {
+		dprintk("%s pre-area pages\n", __func__);
+		/* Portion of used block is not initialized */
+		if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
+			set_needs_init(array, s);
+	}
+	if (_set_range(&marks->im_tree, EXTENT_INITIALIZED, offset, length))
+		goto out_unlock;
+	for (s = normalize_up(offset + length, PAGE_CACHE_SECTORS);
+	     s < end; s += PAGE_CACHE_SECTORS) {
+		dprintk("%s post-area pages\n", __func__);
+		if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
+			set_needs_init(array, s);
+	}
+
+	spin_unlock(&marks->im_lock);
+
+	if (pages) {
+		if (array[0] == ~0) {
+			kfree(array);
+			*pages = NULL;
+		} else
+			*pages = array;
+	}
+	return 0;
+
+ out_unlock:
+	spin_unlock(&marks->im_lock);
+ outerr:
+	if (pages) {
+		kfree(array);
+		*pages = NULL;
+	}
+	return -ENOMEM;
+}
+
 static void print_bl_extent(struct pnfs_block_extent *be)
 {
 	dprintk("PRINT EXTENT extent %p\n", be);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 22/34] pnfsblock: merge rw extents
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (20 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 21/34] pnfsblock: SPLITME: add extent manipulation functions Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 23/34] pnfsblock: encode_layoutcommit Jim Rees
                   ` (11 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/extents.c |   47 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 3d36f66..43a3601 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -501,3 +501,50 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
 	print_bl_extent(ret);
 	return ret;
 }
+
+/* Helper function to set_to_rw that initialize a new extent */
+static void
+_prep_new_extent(struct pnfs_block_extent *new,
+		 struct pnfs_block_extent *orig,
+		 sector_t offset, sector_t length, int state)
+{
+	kref_init(&new->be_refcnt);
+	/* don't need to INIT_LIST_HEAD(&new->be_node) */
+	memcpy(&new->be_devid, &orig->be_devid, sizeof(struct nfs4_deviceid));
+	new->be_mdev = orig->be_mdev;
+	new->be_f_offset = offset;
+	new->be_length = length;
+	new->be_v_offset = orig->be_v_offset - orig->be_f_offset + offset;
+	new->be_state = state;
+	new->be_inval = orig->be_inval;
+}
+
+/* Tries to merge be with extent in front of it in list.
+ * Frees storage if not used.
+ */
+static struct pnfs_block_extent *
+_front_merge(struct pnfs_block_extent *be, struct list_head *head,
+	     struct pnfs_block_extent *storage)
+{
+	struct pnfs_block_extent *prev;
+
+	if (!storage)
+		goto no_merge;
+	if (&be->be_node == head || be->be_node.prev == head)
+		goto no_merge;
+	prev = list_entry(be->be_node.prev, struct pnfs_block_extent, be_node);
+	if ((prev->be_f_offset + prev->be_length != be->be_f_offset) ||
+	    !extents_consistent(prev, be))
+		goto no_merge;
+	_prep_new_extent(storage, prev, prev->be_f_offset,
+			 prev->be_length + be->be_length, prev->be_state);
+	list_replace(&prev->be_node, &storage->be_node);
+	put_extent(prev);
+	list_del(&be->be_node);
+	put_extent(be);
+	return storage;
+
+ no_merge:
+	kfree(storage);
+	return be;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 23/34] pnfsblock: encode_layoutcommit
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (21 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 22/34] pnfsblock: merge rw extents Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-14 15:44   ` Benny Halevy
  2011-06-12 23:44 ` [PATCH 24/34] pnfsblock: cleanup_layoutcommit Jim Rees
                   ` (10 subsequent siblings)
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
   extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
   and a structure is allocated for communication with
   bl_encode_layoutcommit && bl_cleanup_layoutcommit
   (Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
   encoding directly into xdr. The commit-extent-list is not
   freed and is stored on above structure.
   FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
   to set_to_rw() as before, but with no need for XDR decoding
   of the list as before. And the commit-extent-list is freed.
   Finally allocated structure is freed.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: prevent commit list corruption]
[pnfsblock: fix layoutcommit with an empty opaque]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |    2 +
 fs/nfs/blocklayout/blocklayout.h |   12 +++
 fs/nfs/blocklayout/extents.c     |  175 ++++++++++++++++++++++++++++----------
 3 files changed, 145 insertions(+), 44 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 36374f4..1c9a5d0 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -156,6 +156,8 @@ static void
 bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
 		       const struct nfs4_layoutcommit_args *arg)
 {
+	dprintk("%s enter\n", __func__);
+	encode_pnfs_block_layoutupdate(BLK_LO2EXT(lo), xdr, arg);
 }
 
 static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index a231d49..03d703b 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -135,6 +135,15 @@ struct pnfs_block_extent {
 	struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
 };
 
+/* Shortened extent used by LAYOUTCOMMIT */
+struct pnfs_block_short_extent {
+	struct list_head bse_node;
+	struct nfs4_deviceid bse_devid;	/* STUB - removable??? */
+	struct block_device *bse_mdev;
+	sector_t	bse_f_offset;	/* the starting offset in the file */
+	sector_t	bse_length;	/* the size of the extent */
+};
+
 static inline void
 INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
 {
@@ -250,6 +259,9 @@ void put_extent(struct pnfs_block_extent *be);
 struct pnfs_block_extent *alloc_extent(void);
 struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
 int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
+int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+				   struct xdr_stream *xdr,
+				   const struct nfs4_layoutcommit_args *arg);
 int add_and_merge_extent(struct pnfs_block_layout *bl,
 			 struct pnfs_block_extent *new);
 
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 43a3601..e754d32 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -286,6 +286,47 @@ int mark_initialized_sectors(struct pnfs_inval_markings *marks,
 	return -ENOMEM;
 }
 
+/* Marks sectors in [offest, offset+length) as having been written to disk.
+ * All lengths should be block aligned.
+ */
+int mark_written_sectors(struct pnfs_inval_markings *marks,
+			 sector_t offset, sector_t length)
+{
+	int status;
+
+	dprintk("%s(offset=%llu,len=%llu) enter\n", __func__,
+		(u64)offset, (u64)length);
+	spin_lock(&marks->im_lock);
+	status = _set_range(&marks->im_tree, EXTENT_WRITTEN, offset, length);
+	spin_unlock(&marks->im_lock);
+	return status;
+}
+
+static void print_short_extent(struct pnfs_block_short_extent *be)
+{
+	dprintk("PRINT SHORT EXTENT extent %p\n", be);
+	if (be) {
+		dprintk("        be_f_offset %llu\n", (u64)be->bse_f_offset);
+		dprintk("        be_length   %llu\n", (u64)be->bse_length);
+	}
+}
+
+void print_clist(struct list_head *list, unsigned int count)
+{
+	struct pnfs_block_short_extent *be;
+	unsigned int i = 0;
+
+	dprintk("****************\n");
+	dprintk("Extent list looks like:\n");
+	list_for_each_entry(be, list, bse_node) {
+		i++;
+		print_short_extent(be);
+	}
+	if (i != count)
+		dprintk("\n\nExpected %u entries\n\n\n", count);
+	dprintk("****************\n");
+}
+
 static void print_bl_extent(struct pnfs_block_extent *be)
 {
 	dprintk("PRINT EXTENT extent %p\n", be);
@@ -386,65 +427,67 @@ add_and_merge_extent(struct pnfs_block_layout *bl,
 	/* Scan for proper place to insert, extending new to the left
 	 * as much as possible.
 	 */
-	list_for_each_entry_safe(be, tmp, list, be_node) {
-		if (new->be_f_offset < be->be_f_offset)
+	list_for_each_entry_safe_reverse(be, tmp, list, be_node) {
+		if (new->be_f_offset >= be->be_f_offset + be->be_length)
 			break;
-		if (end <= be->be_f_offset + be->be_length) {
-			/* new is a subset of existing be*/
+		if (new->be_f_offset >= be->be_f_offset) {
+			if (end <= be->be_f_offset + be->be_length) {
+				/* new is a subset of existing be*/
+				if (extents_consistent(be, new)) {
+					dprintk("%s: new is subset, ignoring\n",
+						__func__);
+					put_extent(new);
+					return 0;
+				} else {
+					goto out_err;
+				}
+			} else {
+				/* |<--   be   -->|
+				 *          |<--   new   -->| */
+				if (extents_consistent(be, new)) {
+					/* extend new to fully replace be */
+					new->be_length += new->be_f_offset -
+						be->be_f_offset;
+					new->be_f_offset = be->be_f_offset;
+					new->be_v_offset = be->be_v_offset;
+					dprintk("%s: removing %p\n", __func__, be);
+					list_del(&be->be_node);
+					put_extent(be);
+				} else {
+					goto out_err;
+				}
+			}
+		} else if (end >= be->be_f_offset + be->be_length) {
+			/* new extent overlap existing be */
 			if (extents_consistent(be, new)) {
-				dprintk("%s: new is subset, ignoring\n",
-					__func__);
-				put_extent(new);
-				return 0;
-			} else
+				/* extend new to fully replace be */
+				dprintk("%s: removing %p\n", __func__, be);
+				list_del(&be->be_node);
+				put_extent(be);
+			} else {
 				goto out_err;
-		} else if (new->be_f_offset <=
-				be->be_f_offset + be->be_length) {
-			/* new overlaps or abuts existing be */
-			if (extents_consistent(be, new)) {
+			}
+		} else if (end > be->be_f_offset) {
+			/*           |<--   be   -->|
+			 *|<--   new   -->| */
+			if (extents_consistent(new, be)) {
 				/* extend new to fully replace be */
-				new->be_length += new->be_f_offset -
-						  be->be_f_offset;
-				new->be_f_offset = be->be_f_offset;
-				new->be_v_offset = be->be_v_offset;
+				new->be_length += be->be_f_offset + be->be_length -
+					new->be_f_offset - new->be_length;
 				dprintk("%s: removing %p\n", __func__, be);
 				list_del(&be->be_node);
 				put_extent(be);
-			} else if (new->be_f_offset !=
-				   be->be_f_offset + be->be_length)
+			} else {
 				goto out_err;
+			}
 		}
 	}
 	/* Note that if we never hit the above break, be will not point to a
 	 * valid extent.  However, in that case &be->be_node==list.
 	 */
-	list_add_tail(&new->be_node, &be->be_node);
+	list_add(&new->be_node, &be->be_node);
 	dprintk("%s: inserting new\n", __func__);
 	print_elist(list);
-	/* Scan forward for overlaps.  If we find any, extend new and
-	 * remove the overlapped extent.
-	 */
-	be = list_prepare_entry(new, list, be_node);
-	list_for_each_entry_safe_continue(be, tmp, list, be_node) {
-		if (end < be->be_f_offset)
-			break;
-		/* new overlaps or abuts existing be */
-		if (extents_consistent(be, new)) {
-			if (end < be->be_f_offset + be->be_length) {
-				/* extend new to fully cover be */
-				end = be->be_f_offset + be->be_length;
-				new->be_length = end - new->be_f_offset;
-			}
-			dprintk("%s: removing %p\n", __func__, be);
-			list_del(&be->be_node);
-			put_extent(be);
-		} else if (end != be->be_f_offset) {
-			list_del(&new->be_node);
-			goto out_err;
-		}
-	}
-	dprintk("%s: after merging\n", __func__);
-	print_elist(list);
 	/* STUB - The per-list consistency checks have all been done,
 	 * should now check cross-list consistency.
 	 */
@@ -502,6 +545,50 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
 	return ret;
 }
 
+int
+encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+			       struct xdr_stream *xdr,
+			       const struct nfs4_layoutcommit_args *arg)
+{
+	struct pnfs_block_short_extent *lce, *save;
+	unsigned int count = 0;
+	struct list_head *ranges = &bl->bl_committing;
+	__be32 *p, *xdr_start;
+
+	dprintk("%s enter\n", __func__);
+	/* BUG - creation of bl_commit is buggy - need to wait for
+	 * entire block to be marked WRITTEN before it can be added.
+	 */
+	spin_lock(&bl->bl_ext_lock);
+	/* Want to adjust for possible truncate */
+	/* We now want to adjust argument range */
+
+	/* XDR encode the ranges found */
+	xdr_start = xdr_reserve_space(xdr, 8);
+	if (!xdr_start)
+		goto out;
+	list_for_each_entry_safe(lce, save, &bl->bl_commit, bse_node) {
+		p = xdr_reserve_space(xdr, 7 * 4 + sizeof(lce->bse_devid.data));
+		if (!p)
+			break;
+		WRITE_DEVID(&lce->bse_devid);
+		WRITE64(lce->bse_f_offset << 9);
+		WRITE64(lce->bse_length << 9);
+		WRITE64(0LL);
+		WRITE32(PNFS_BLOCK_READWRITE_DATA);
+		list_del(&lce->bse_node);
+		list_add_tail(&lce->bse_node, ranges);
+		bl->bl_count--;
+		count++;
+	}
+	xdr_start[0] = cpu_to_be32((xdr->p - xdr_start - 1) * 4);
+	xdr_start[1] = cpu_to_be32(count);
+out:
+	spin_unlock(&bl->bl_ext_lock);
+	dprintk("%s found %i ranges\n", __func__, count);
+	return 0;
+}
+
 /* Helper function to set_to_rw that initialize a new extent */
 static void
 _prep_new_extent(struct pnfs_block_extent *new,
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 24/34] pnfsblock: cleanup_layoutcommit
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (22 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 23/34] pnfsblock: encode_layoutcommit Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 25/34] pnfsblock: bl_read_pagelist Jim Rees
                   ` (9 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
   extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
   and a structure is allocated for communication with
   bl_encode_layoutcommit && bl_cleanup_layoutcommit
   (Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
   encoding directly into xdr. The commit-extent-list is not
   freed and is stored on above structure.
   FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
   to set_to_rw() as before, but with no need for XDR decoding
   of the list as before. And the commit-extent-list is freed.
   Finally allocated structure is freed.

[SQUASHME: pnfs: blocklayout: port block layout code]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: cleanup_layoutcommit wants a status parameter]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |    2 +
 fs/nfs/blocklayout/blocklayout.h |    3 +
 fs/nfs/blocklayout/extents.c     |  209 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 214 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 1c9a5d0..2cc5be7 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -164,6 +164,8 @@ static void
 bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
 			struct nfs4_layoutcommit_data *lcdata)
 {
+	dprintk("%s enter\n", __func__);
+	clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args, lcdata->res.status);
 }
 
 static void free_blk_mountid(struct block_mount_id *mid)
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 03d703b..3b3e70a 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -262,6 +262,9 @@ int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
 int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
 				   struct xdr_stream *xdr,
 				   const struct nfs4_layoutcommit_args *arg);
+void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+				   const struct nfs4_layoutcommit_args *arg,
+				   int status);
 int add_and_merge_extent(struct pnfs_block_layout *bl,
 			 struct pnfs_block_extent *new);
 
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index e754d32..1447bfc 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -327,6 +327,73 @@ void print_clist(struct list_head *list, unsigned int count)
 	dprintk("****************\n");
 }
 
+/* Note: In theory, we should do more checking that devid's match between
+ * old and new, but if they don't, the lists are too corrupt to salvage anyway.
+ */
+/* Note this is very similar to add_and_merge_extent */
+static void add_to_commitlist(struct pnfs_block_layout *bl,
+			      struct pnfs_block_short_extent *new)
+{
+	struct list_head *clist = &bl->bl_commit;
+	struct pnfs_block_short_extent *old, *save;
+	sector_t end = new->bse_f_offset + new->bse_length;
+
+	dprintk("%s enter\n", __func__);
+	print_short_extent(new);
+	print_clist(clist, bl->bl_count);
+	bl->bl_count++;
+	/* Scan for proper place to insert, extending new to the left
+	 * as much as possible.
+	 */
+	list_for_each_entry_safe(old, save, clist, bse_node) {
+		if (new->bse_f_offset < old->bse_f_offset)
+			break;
+		if (end <= old->bse_f_offset + old->bse_length) {
+			/* Range is already in list */
+			bl->bl_count--;
+			kfree(new);
+			return;
+		} else if (new->bse_f_offset <=
+				old->bse_f_offset + old->bse_length) {
+			/* new overlaps or abuts existing be */
+			if (new->bse_mdev == old->bse_mdev) {
+				/* extend new to fully replace old */
+				new->bse_length += new->bse_f_offset -
+						old->bse_f_offset;
+				new->bse_f_offset = old->bse_f_offset;
+				list_del(&old->bse_node);
+				bl->bl_count--;
+				kfree(old);
+			}
+		}
+	}
+	/* Note that if we never hit the above break, old will not point to a
+	 * valid extent.  However, in that case &old->bse_node==list.
+	 */
+	list_add_tail(&new->bse_node, &old->bse_node);
+	/* Scan forward for overlaps.  If we find any, extend new and
+	 * remove the overlapped extent.
+	 */
+	old = list_prepare_entry(new, clist, bse_node);
+	list_for_each_entry_safe_continue(old, save, clist, bse_node) {
+		if (end < old->bse_f_offset)
+			break;
+		/* new overlaps or abuts old */
+		if (new->bse_mdev == old->bse_mdev) {
+			if (end < old->bse_f_offset + old->bse_length) {
+				/* extend new to fully cover old */
+				end = old->bse_f_offset + old->bse_length;
+				new->bse_length = end - new->bse_f_offset;
+			}
+			list_del(&old->bse_node);
+			bl->bl_count--;
+			kfree(old);
+		}
+	}
+	dprintk("%s: after merging\n", __func__);
+	print_clist(clist, bl->bl_count);
+}
+
 static void print_bl_extent(struct pnfs_block_extent *be)
 {
 	dprintk("PRINT EXTENT extent %p\n", be);
@@ -545,6 +612,34 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
 	return ret;
 }
 
+/* Similar to find_get_extent, but called with lock held, and ignores cow */
+static struct pnfs_block_extent *
+find_get_extent_locked(struct pnfs_block_layout *bl, sector_t isect)
+{
+	struct pnfs_block_extent *be, *ret = NULL;
+	int i;
+
+	dprintk("%s enter with isect %llu\n", __func__, (u64)isect);
+	for (i = 0; i < EXTENT_LISTS; i++) {
+		if (ret)
+			break;
+		list_for_each_entry_reverse(be, &bl->bl_extents[i], be_node) {
+			if (isect >= be->be_f_offset + be->be_length)
+				break;
+			if (isect >= be->be_f_offset) {
+				/* We have found an extent */
+				dprintk("%s Get %p (%i)\n", __func__, be,
+					atomic_read(&be->be_refcnt.refcount));
+				kref_get(&be->be_refcnt);
+				ret = be;
+				break;
+			}
+		}
+	}
+	print_bl_extent(ret);
+	return ret;
+}
+
 int
 encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
 			       struct xdr_stream *xdr,
@@ -635,3 +730,117 @@ _front_merge(struct pnfs_block_extent *be, struct list_head *head,
 	kfree(storage);
 	return be;
 }
+
+static u64
+set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
+{
+	u64 rv = offset + length;
+	struct pnfs_block_extent *be, *e1, *e2, *e3, *new, *old;
+	struct pnfs_block_extent *children[3];
+	struct pnfs_block_extent *merge1 = NULL, *merge2 = NULL;
+	int i = 0, j;
+
+	dprintk("%s(%llu, %llu)\n", __func__, offset, length);
+	/* Create storage for up to three new extents e1, e2, e3 */
+	e1 = kmalloc(sizeof(*e1), GFP_KERNEL);
+	e2 = kmalloc(sizeof(*e2), GFP_KERNEL);
+	e3 = kmalloc(sizeof(*e3), GFP_KERNEL);
+	/* BUG - we are ignoring any failure */
+	if (!e1 || !e2 || !e3)
+		goto out_nosplit;
+
+	spin_lock(&bl->bl_ext_lock);
+	be = find_get_extent_locked(bl, offset);
+	rv = be->be_f_offset + be->be_length;
+	if (be->be_state != PNFS_BLOCK_INVALID_DATA) {
+		spin_unlock(&bl->bl_ext_lock);
+		goto out_nosplit;
+	}
+	/* Add e* to children, bumping e*'s krefs */
+	if (be->be_f_offset != offset) {
+		_prep_new_extent(e1, be, be->be_f_offset,
+				 offset - be->be_f_offset,
+				 PNFS_BLOCK_INVALID_DATA);
+		children[i++] = e1;
+		print_bl_extent(e1);
+	} else
+		merge1 = e1;
+	_prep_new_extent(e2, be, offset,
+			 min(length, be->be_f_offset + be->be_length - offset),
+			 PNFS_BLOCK_READWRITE_DATA);
+	children[i++] = e2;
+	print_bl_extent(e2);
+	if (offset + length < be->be_f_offset + be->be_length) {
+		_prep_new_extent(e3, be, e2->be_f_offset + e2->be_length,
+				 be->be_f_offset + be->be_length -
+				 offset - length,
+				 PNFS_BLOCK_INVALID_DATA);
+		children[i++] = e3;
+		print_bl_extent(e3);
+	} else
+		merge2 = e3;
+
+	/* Remove be from list, and insert the e* */
+	/* We don't get refs on e*, since this list is the base reference
+	 * set when init'ed.
+	 */
+	if (i < 3)
+		children[i] = NULL;
+	new = children[0];
+	list_replace(&be->be_node, &new->be_node);
+	put_extent(be);
+	new = _front_merge(new, &bl->bl_extents[RW_EXTENT], merge1);
+	for (j = 1; j < i; j++) {
+		old = new;
+		new = children[j];
+		list_add(&new->be_node, &old->be_node);
+	}
+	if (merge2) {
+		/* This is a HACK, should just create a _back_merge function */
+		new = list_entry(new->be_node.next,
+				 struct pnfs_block_extent, be_node);
+		new = _front_merge(new, &bl->bl_extents[RW_EXTENT], merge2);
+	}
+	spin_unlock(&bl->bl_ext_lock);
+
+	/* Since we removed the base reference above, be is now scheduled for
+	 * destruction.
+	 */
+	put_extent(be);
+	dprintk("%s returns %llu after split\n", __func__, rv);
+	return rv;
+
+ out_nosplit:
+	kfree(e1);
+	kfree(e2);
+	kfree(e3);
+	dprintk("%s returns %llu without splitting\n", __func__, rv);
+	return rv;
+}
+
+void
+clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+			      const struct nfs4_layoutcommit_args *arg,
+			      int status)
+{
+	struct pnfs_block_short_extent *lce, *save;
+
+	dprintk("%s status %d\n", __func__, status);
+	list_for_each_entry_safe_reverse(lce, save, &bl->bl_committing, bse_node) {
+		if (likely(!status)) {
+			u64 offset = lce->bse_f_offset;
+			u64 end = offset + lce->bse_length;
+
+			do {
+				offset = set_to_rw(bl, offset, end - offset);
+			} while (offset < end);
+			list_del(&lce->bse_node);
+
+			kfree(lce);
+		} else {
+			spin_lock(&bl->bl_ext_lock);
+			add_to_commitlist(bl, lce);
+			spin_unlock(&bl->bl_ext_lock);
+		}
+	}
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 25/34] pnfsblock: bl_read_pagelist
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (23 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 24/34] pnfsblock: cleanup_layoutcommit Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 26/34] pnfsblock: write_begin Jim Rees
                   ` (8 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and  leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.

When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.

[pnfsblock: read path error handling]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new read_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |  259 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 259 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 2cc5be7..63fddca 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -31,6 +31,7 @@
  */
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/bio.h> /* struct bio */
 #include <linux/vmalloc.h>
 #include "blocklayout.h"
 
@@ -40,9 +41,267 @@ MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Andy Adamson <andros@citi.umich.edu>");
 MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");
 
+static void print_page(struct page *page)
+{
+	dprintk("PRINTPAGE page %p\n", page);
+	dprintk("        PagePrivate %d\n", PagePrivate(page));
+	dprintk("        PageUptodate %d\n", PageUptodate(page));
+	dprintk("        PageError %d\n", PageError(page));
+	dprintk("        PageDirty %d\n", PageDirty(page));
+	dprintk("        PageReferenced %d\n", PageReferenced(page));
+	dprintk("        PageLocked %d\n", PageLocked(page));
+	dprintk("        PageWriteback %d\n", PageWriteback(page));
+	dprintk("        PageMappedToDisk %d\n", PageMappedToDisk(page));
+	dprintk("\n");
+}
+
+/* Given the be associated with isect, determine if page data needs to be
+ * initialized.
+ */
+static int is_hole(struct pnfs_block_extent *be, sector_t isect)
+{
+	if (be->be_state == PNFS_BLOCK_NONE_DATA)
+		return 1;
+	else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+		return 0;
+	else
+		return !is_sector_initialized(be->be_inval, isect);
+}
+
+static int
+dont_like_caller(struct nfs_page *req)
+{
+	if (atomic_read(&req->wb_complete)) {
+		/* Called by _multi */
+		return 1;
+	} else {
+		/* Called by _one */
+		return 0;
+	}
+}
+
+/* The data we are handed might be spread across several bios.  We need
+ * to track when the last one is finished.
+ */
+struct parallel_io {
+	struct kref refcnt;
+	struct rpc_call_ops call_ops;
+	void (*pnfs_callback) (void *data);
+	void *data;
+};
+
+static inline struct parallel_io *alloc_parallel(void *data)
+{
+	struct parallel_io *rv;
+
+	rv  = kmalloc(sizeof(*rv), GFP_KERNEL);
+	if (rv) {
+		rv->data = data;
+		kref_init(&rv->refcnt);
+	}
+	return rv;
+}
+
+static inline void get_parallel(struct parallel_io *p)
+{
+	kref_get(&p->refcnt);
+}
+
+static void destroy_parallel(struct kref *kref)
+{
+	struct parallel_io *p = container_of(kref, struct parallel_io, refcnt);
+
+	dprintk("%s enter\n", __func__);
+	p->pnfs_callback(p->data);
+	kfree(p);
+}
+
+static inline void put_parallel(struct parallel_io *p)
+{
+	kref_put(&p->refcnt, destroy_parallel);
+}
+
+static struct bio *
+bl_submit_bio(int rw, struct bio *bio)
+{
+	if (bio) {
+		get_parallel(bio->bi_private);
+		dprintk("%s submitting %s bio %u@%llu\n", __func__,
+			rw == READ ? "read" : "write",
+			bio->bi_size, (u64)bio->bi_sector);
+		submit_bio(rw, bio);
+	}
+	return NULL;
+}
+
+static inline void
+bl_done_with_rpage(struct page *page, const int ok)
+{
+	if (ok) {
+		ClearPagePnfsErr(page);
+		SetPageUptodate(page);
+	} else {
+		ClearPageUptodate(page);
+		SetPageError(page);
+		SetPagePnfsErr(page);
+	}
+	/* Page is unlocked via rpc_release.  Should really be done here. */
+}
+
+/* This is basically copied from mpage_end_io_read */
+static void bl_end_io_read(struct bio *bio, int err)
+{
+	void *data = bio->bi_private;
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+
+	do {
+		struct page *page = bvec->bv_page;
+
+		if (--bvec >= bio->bi_io_vec)
+			prefetchw(&bvec->bv_page->flags);
+		bl_done_with_rpage(page, uptodate);
+	} while (bvec >= bio->bi_io_vec);
+	bio_put(bio);
+	put_parallel(data);
+}
+
+static void bl_read_cleanup(struct work_struct *work)
+{
+	struct rpc_task *task;
+	struct nfs_read_data *rdata;
+	dprintk("%s enter\n", __func__);
+	task = container_of(work, struct rpc_task, u.tk_work);
+	rdata = container_of(task, struct nfs_read_data, task);
+	pnfs_ld_read_done(rdata);
+}
+
+static void
+bl_end_par_io_read(void *data)
+{
+	struct nfs_read_data *rdata = data;
+
+	INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
+	schedule_work(&rdata->task.u.tk_work);
+}
+
+/* We don't want normal .rpc_call_done callback used, so we replace it
+ * with this stub.
+ */
+static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
+{
+	return;
+}
+
 static enum pnfs_try_status
 bl_read_pagelist(struct nfs_read_data *rdata)
 {
+	int i, hole;
+	struct bio *bio = NULL;
+	struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+	sector_t isect, extent_length = 0;
+	struct parallel_io *par;
+	loff_t f_offset = rdata->args.offset;
+	size_t count = rdata->args.count;
+	struct page **pages = rdata->args.pages;
+	int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
+
+	dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
+	       rdata->npages, f_offset, count);
+
+	if (dont_like_caller(rdata->req)) {
+		dprintk("%s dont_like_caller failed\n", __func__);
+		goto use_mds;
+	}
+	if ((rdata->npages == 1) && PagePnfsErr(rdata->req->wb_page)) {
+		/* We want to fall back to mds in case of read_page
+		 * after error on read_pages.
+		 */
+		dprintk("%s PG_pnfserr set\n", __func__);
+		goto use_mds;
+	}
+	par = alloc_parallel(rdata);
+	if (!par)
+		goto use_mds;
+	par->call_ops = *rdata->mds_ops;
+	par->call_ops.rpc_call_done = bl_rpc_do_nothing;
+	par->pnfs_callback = bl_end_par_io_read;
+	/* At this point, we can no longer jump to use_mds */
+
+	isect = (sector_t) (f_offset >> 9);
+	/* Code assumes extents are page-aligned */
+	for (i = pg_index; i < rdata->npages; i++) {
+		if (!extent_length) {
+			/* We've used up the previous extent */
+			put_extent(be);
+			put_extent(cow_read);
+			bio = bl_submit_bio(READ, bio);
+			/* Get the next one */
+			be = find_get_extent(BLK_LSEG2EXT(rdata->lseg),
+					     isect, &cow_read);
+			if (!be) {
+				/* Error out this page */
+				bl_done_with_rpage(pages[i], 0);
+				break;
+			}
+			extent_length = be->be_length -
+				(isect - be->be_f_offset);
+			if (cow_read) {
+				sector_t cow_length = cow_read->be_length -
+					(isect - cow_read->be_f_offset);
+				extent_length = min(extent_length, cow_length);
+			}
+		}
+		hole = is_hole(be, isect);
+		if (hole && !cow_read) {
+			bio = bl_submit_bio(READ, bio);
+			/* Fill hole w/ zeroes w/o accessing device */
+			dprintk("%s Zeroing page for hole\n", __func__);
+			zero_user(pages[i], 0,
+				  min_t(int, PAGE_CACHE_SIZE, count));
+			print_page(pages[i]);
+			bl_done_with_rpage(pages[i], 1);
+		} else {
+			struct pnfs_block_extent *be_read;
+
+			be_read = (hole && cow_read) ? cow_read : be;
+			for (;;) {
+				if (!bio) {
+					bio = bio_alloc(GFP_NOIO, rdata->npages - i);
+					if (!bio) {
+						/* Error out this page */
+						bl_done_with_rpage(pages[i], 0);
+						break;
+					}
+					bio->bi_sector = isect -
+						be_read->be_f_offset +
+						be_read->be_v_offset;
+					bio->bi_bdev = be_read->be_mdev;
+					bio->bi_end_io = bl_end_io_read;
+					bio->bi_private = par;
+				}
+				if (bio_add_page(bio, pages[i], PAGE_SIZE, 0))
+					break;
+				bio = bl_submit_bio(READ, bio);
+			}
+		}
+		isect += PAGE_CACHE_SIZE >> 9;
+		extent_length -= PAGE_CACHE_SIZE >> 9;
+	}
+	if ((isect << 9) >= rdata->inode->i_size) {
+		rdata->res.eof = 1;
+		rdata->res.count = rdata->inode->i_size - f_offset;
+	} else {
+		rdata->res.count = (isect << 9) - f_offset;
+	}
+	put_extent(be);
+	put_extent(cow_read);
+	bl_submit_bio(READ, bio);
+	put_parallel(par);
+	return PNFS_ATTEMPTED;
+
+ use_mds:
+	dprintk("Giving up and using normal NFS\n");
 	return PNFS_NOT_ATTEMPTED;
 }
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 26/34] pnfsblock: write_begin
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (24 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 25/34] pnfsblock: bl_read_pagelist Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 27/34] pnfsblock: write_end Jim Rees
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Implements bl_write_begin and bl_do_flush, allowing block driver to read
in page "around" the data that is about to be copied to the page.

[pnfsblock: fix 64-bit compiler warnings for write_begin]
[pnfsblock: write_begin adjust for removed fields]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |  178 +++++++++++++++++++++++++++++++++++++-
 1 files changed, 177 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 63fddca..5d7cb86 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -31,6 +31,8 @@
  */
 #include <linux/module.h>
 #include <linux/init.h>
+
+#include <linux/buffer_head.h> /* various write calls */
 #include <linux/bio.h> /* struct bio */
 #include <linux/vmalloc.h>
 #include "blocklayout.h"
@@ -592,11 +594,185 @@ bl_clear_layoutdriver(struct nfs_server *server)
 	return 0;
 }
 
+/* STUB - mark intersection of layout and page as bad, so is not
+ * used again.
+ */
+static void mark_bad_read(void)
+{
+	return;
+}
+
+/* Copied from buffer.c */
+static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate)
+{
+	if (uptodate) {
+		set_buffer_uptodate(bh);
+	} else {
+		/* This happens, due to failed READA attempts. */
+		clear_buffer_uptodate(bh);
+	}
+	unlock_buffer(bh);
+}
+
+/* Copied from buffer.c */
+static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate)
+{
+	__end_buffer_read_notouch(bh, uptodate);
+}
+
+/*
+ * map_block:  map a requested I/0 block (isect) into an offset in the LVM
+ * meta block_device
+ */
+static void
+map_block(sector_t isect, struct pnfs_block_extent *be, struct buffer_head *bh)
+{
+	dprintk("%s enter be=%p\n", __func__, be);
+
+	set_buffer_mapped(bh);
+	bh->b_bdev = be->be_mdev;
+	bh->b_blocknr = (isect - be->be_f_offset + be->be_v_offset) >>
+		(be->be_mdev->bd_inode->i_blkbits - 9);
+
+	dprintk("%s isect %ld, bh->b_blocknr %ld, using bsize %Zd\n",
+				__func__, (long)isect,
+				(long)bh->b_blocknr,
+				bh->b_size);
+	return;
+}
+
+/* Given an unmapped page, zero it (or read in page for COW),
+ * and set appropriate flags/markings, but it is safe to not initialize
+ * the range given in [from, to).
+ */
+/* This is loosely based on nobh_write_begin */
+static int
+init_page_for_write(struct pnfs_block_layout *bl, struct page *page,
+		    unsigned from, unsigned to, sector_t **pages_to_mark)
+{
+	struct buffer_head *bh;
+	int inval, ret = -EIO;
+	struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+	sector_t isect;
+
+	dprintk("%s enter, %p\n", __func__, page);
+	bh = alloc_page_buffers(page, PAGE_CACHE_SIZE, 0);
+	if (!bh) {
+		ret = -ENOMEM;
+		goto cleanup;
+	}
+
+	isect = (sector_t)page->index << (PAGE_CACHE_SHIFT - 9);
+	be = find_get_extent(bl, isect, &cow_read);
+	if (!be)
+		goto cleanup;
+	inval = is_hole(be, isect);
+	dprintk("%s inval=%i, from=%u, to=%u\n", __func__, inval, from, to);
+	if (inval) {
+		if (be->be_state == PNFS_BLOCK_NONE_DATA) {
+			dprintk("%s PANIC - got NONE_DATA extent %p\n",
+				__func__, be);
+			goto cleanup;
+		}
+		map_block(isect, be, bh);
+		unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
+	}
+	if (PageUptodate(page)) {
+		/* Do nothing */
+	} else if (inval & !cow_read) {
+		zero_user_segments(page, 0, from, to, PAGE_CACHE_SIZE);
+	} else if (0 < from || PAGE_CACHE_SIZE > to) {
+		struct pnfs_block_extent *read_extent;
+
+		read_extent = (inval && cow_read) ? cow_read : be;
+		map_block(isect, read_extent, bh);
+		lock_buffer(bh);
+		bh->b_end_io = end_buffer_read_nobh;
+		submit_bh(READ, bh);
+		dprintk("%s: Waiting for buffer read\n", __func__);
+		/* XXX Don't really want to hold layout lock here */
+		wait_on_buffer(bh);
+		if (!buffer_uptodate(bh))
+			goto cleanup;
+	}
+	if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+		/* There is a BUG here if is a short copy after write_begin,
+		 * but I think this is a generic fs bug.  The problem is that
+		 * we have marked the page as initialized, but it is possible
+		 * that the section not copied may never get copied.
+		 */
+		ret = mark_initialized_sectors(be->be_inval, isect,
+					       PAGE_CACHE_SECTORS,
+					       pages_to_mark);
+		/* Want to preallocate mem so above can't fail */
+		if (ret)
+			goto cleanup;
+	}
+	SetPageMappedToDisk(page);
+	ret = 0;
+
+cleanup:
+	free_buffer_head(bh);
+	put_extent(be);
+	put_extent(cow_read);
+	if (ret) {
+		/* Need to mark layout with bad read...should now
+		 * just use nfs4 for reads and writes.
+		 */
+		mark_bad_read();
+	}
+	return ret;
+}
+
 static int
 bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
 	       unsigned count, struct pnfs_fsdata *fsdata)
 {
-	return 0;
+	unsigned from, to;
+	int ret;
+	sector_t *pages_to_mark = NULL;
+	struct pnfs_block_layout *bl = BLK_LSEG2EXT(lseg);
+
+	dprintk("%s enter, %u@%lld\n", __func__, count, pos);
+	print_page(page);
+	/* The following code assumes blocksize >= PAGE_CACHE_SIZE */
+	if (bl->bl_blocksize < (PAGE_CACHE_SIZE >> 9)) {
+		dprintk("%s Can't handle blocksize %llu\n", __func__,
+			(u64)bl->bl_blocksize);
+		put_lseg(fsdata->lseg);
+		fsdata->lseg = NULL;
+		return 0;
+	}
+	if (PageMappedToDisk(page)) {
+		/* Basically, this is a flag that says we have
+		 * successfully called write_begin already on this page.
+		 */
+		/* NOTE - there are cache consistency issues here.
+		 * For example, what if the layout is recalled, then regained?
+		 * If the file is closed and reopened, will the page flags
+		 * be reset?  If not, we'll have to use layout info instead of
+		 * the page flag.
+		 */
+		return 0;
+	}
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + count;
+	ret = init_page_for_write(bl, page, from, to, &pages_to_mark);
+	if (ret) {
+		dprintk("%s init page failed with %i", __func__, ret);
+		/* Revert back to plain NFS and just continue on with
+		 * write.  This assumes there is no request attached, which
+		 * should be true if we get here.
+		 */
+		BUG_ON(PagePrivate(page));
+		put_lseg(fsdata->lseg);
+		fsdata->lseg = NULL;
+		kfree(pages_to_mark);
+		ret = 0;
+	} else {
+		fsdata->private = pages_to_mark;
+	}
+	return ret;
 }
 
 static int
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 27/34] pnfsblock: write_end
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (25 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 26/34] pnfsblock: write_begin Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:44 ` [PATCH 28/34] pnfsblock: write_end_cleanup Jim Rees
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Implements bl_write_end, which basically just calls SetPageUptodate.

[pnfsblock: write_end adjust for removed ok_to_use_pnfs]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 5d7cb86..8914143 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -775,10 +775,15 @@ bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
 	return ret;
 }
 
+/* CAREFUL - what happens if copied < count??? */
 static int
 bl_write_end(struct inode *inode, struct page *page, loff_t pos,
 	     unsigned count, unsigned copied, struct pnfs_layout_segment *lseg)
 {
+	dprintk("%s enter, %u@%lld, lseg=%p\n", __func__, count, pos, lseg);
+	print_page(page);
+	if (lseg)
+		SetPageUptodate(page);
 	return 0;
 }
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 28/34] pnfsblock: write_end_cleanup
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (26 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 27/34] pnfsblock: write_end Jim Rees
@ 2011-06-12 23:44 ` Jim Rees
  2011-06-12 23:45 ` [PATCH 29/34] pnfsblock: bl_write_pagelist support functions Jim Rees
                   ` (5 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:44 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Ensure all pages in block are marked for initialization if needed.

[pnfsblock: Update to 2.6.29]
[pnfsblock: write_end_cleanup adjust for removed ok_to_use_pnfs]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |   54 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 8914143..a40616b 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -794,6 +794,60 @@ bl_write_end(struct inode *inode, struct page *page, loff_t pos,
 static void
 bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
 {
+	struct page *page;
+	pgoff_t index;
+	sector_t *pos;
+	struct address_space *mapping = filp->f_mapping;
+	struct pnfs_fsdata *fake_data;
+	struct pnfs_layout_segment *lseg;
+
+	if (!fsdata)
+		return;
+	lseg = fsdata->lseg;
+	if (!lseg)
+		return;
+	pos = fsdata->private;
+	if (!pos)
+		return;
+	dprintk("%s enter with pos=%llu\n", __func__, (u64)(*pos));
+	for (; *pos != ~0; pos++) {
+		index = *pos >> (PAGE_CACHE_SHIFT - 9);
+		/* XXX How do we properly deal with failures here??? */
+		page = grab_cache_page_write_begin(mapping, index, 0);
+		if (!page) {
+			printk(KERN_ERR "%s BUG BUG BUG NoMem\n", __func__);
+			continue;
+		}
+		dprintk("%s: Examining block page\n", __func__);
+		print_page(page);
+		if (!PageMappedToDisk(page)) {
+			/* XXX How do we properly deal with failures here??? */
+			dprintk("%s Marking block page\n", __func__);
+			init_page_for_write(BLK_LSEG2EXT(fsdata->lseg), page,
+					    PAGE_CACHE_SIZE, PAGE_CACHE_SIZE,
+					    NULL);
+			print_page(page);
+			fake_data = kzalloc(sizeof(*fake_data), GFP_KERNEL);
+			if (!fake_data) {
+				printk(KERN_ERR "%s BUG BUG BUG NoMem\n",
+				       __func__);
+				unlock_page(page);
+				continue;
+			}
+			get_lseg(lseg);
+			fake_data->lseg = lseg;
+			fake_data->bypass_eof = 1;
+			mapping->a_ops->write_end(filp, mapping,
+						  index << PAGE_CACHE_SHIFT,
+						  PAGE_CACHE_SIZE,
+						  PAGE_CACHE_SIZE,
+						  page, fake_data);
+			/* Note fake_data is freed by nfs_write_end */
+		} else
+			unlock_page(page);
+	}
+	kfree(fsdata->private);
+	fsdata->private = NULL;
 }
 
 static struct pnfs_layoutdriver_type blocklayout_type = {
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 29/34] pnfsblock: bl_write_pagelist support functions
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (27 preceding siblings ...)
  2011-06-12 23:44 ` [PATCH 28/34] pnfsblock: write_end_cleanup Jim Rees
@ 2011-06-12 23:45 ` Jim Rees
  2011-06-12 23:45 ` [PATCH 30/34] pnfsblock: bl_write_pagelist Jim Rees
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:45 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fixup blksize alignment in bl_setup_layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index a40616b..9361e5b 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -70,6 +70,19 @@ static int is_hole(struct pnfs_block_extent *be, sector_t isect)
 		return !is_sector_initialized(be->be_inval, isect);
 }
 
+/* Given the be associated with isect, determine if page data can be
+ * written to disk.
+ */
+static int is_writable(struct pnfs_block_extent *be, sector_t isect)
+{
+	if (be->be_state == PNFS_BLOCK_READWRITE_DATA)
+		return 1;
+	else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+		return 0;
+	else
+		return is_sector_initialized(be->be_inval, isect);
+}
+
 static int
 dont_like_caller(struct nfs_page *req)
 {
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 30/34] pnfsblock: bl_write_pagelist
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (28 preceding siblings ...)
  2011-06-12 23:45 ` [PATCH 29/34] pnfsblock: bl_write_pagelist support functions Jim Rees
@ 2011-06-12 23:45 ` Jim Rees
  2011-06-12 23:45 ` [PATCH 31/34] pnfsblock: note written INVAL areas for layoutcommit Jim Rees
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:45 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and  leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.

When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.

[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |  147 +++++++++++++++++++++++++++++++++++++-
 1 files changed, 146 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 9361e5b..03ce7c3 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -320,11 +320,156 @@ bl_read_pagelist(struct nfs_read_data *rdata)
 	return PNFS_NOT_ATTEMPTED;
 }
 
+/* STUB - this needs thought */
+static inline void
+bl_done_with_wpage(struct page *page, const int ok)
+{
+	if (!ok) {
+		SetPageError(page);
+		SetPagePnfsErr(page);
+		/* This is an inline copy of nfs_zap_mapping */
+		/* This is oh so fishy, and needs deep thought */
+		if (page->mapping->nrpages != 0) {
+			struct inode *inode = page->mapping->host;
+			spin_lock(&inode->i_lock);
+			NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA;
+			spin_unlock(&inode->i_lock);
+		}
+	}
+	/* end_page_writeback called in rpc_release.  Should be done here. */
+}
+
+/* This is basically copied from mpage_end_io_read */
+static void bl_end_io_write(struct bio *bio, int err)
+{
+	void *data = bio->bi_private;
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+
+	do {
+		struct page *page = bvec->bv_page;
+
+		if (--bvec >= bio->bi_io_vec)
+			prefetchw(&bvec->bv_page->flags);
+		bl_done_with_wpage(page, uptodate);
+	} while (bvec >= bio->bi_io_vec);
+	bio_put(bio);
+	put_parallel(data);
+}
+
+/* Function scheduled for call during bl_end_par_io_write,
+ * it marks sectors as written and extends the commitlist.
+ */
+static void bl_write_cleanup(struct work_struct *work)
+{
+	struct rpc_task *task;
+	struct nfs_write_data *wdata;
+	dprintk("%s enter\n", __func__);
+	task = container_of(work, struct rpc_task, u.tk_work);
+	wdata = container_of(task, struct nfs_write_data, task);
+	pnfs_ld_write_done(wdata);
+}
+
+/* Called when last of bios associated with a bl_write_pagelist call finishes */
+static void
+bl_end_par_io_write(void *data)
+{
+	struct nfs_write_data *wdata = data;
+
+	/* STUB - ignoring error handling */
+	wdata->task.tk_status = 0;
+	wdata->verf.committed = NFS_FILE_SYNC;
+	INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
+	schedule_work(&wdata->task.u.tk_work);
+}
+
 static enum pnfs_try_status
 bl_write_pagelist(struct nfs_write_data *wdata,
 		  int sync)
 {
-	return PNFS_NOT_ATTEMPTED;
+	int i;
+	struct bio *bio = NULL;
+	struct pnfs_block_extent *be = NULL;
+	sector_t isect, extent_length = 0;
+	struct parallel_io *par;
+	loff_t offset = wdata->args.offset;
+	size_t count = wdata->args.count;
+	struct page **pages = wdata->args.pages;
+	int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
+
+	dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
+	if (!wdata->lseg) {
+		dprintk("%s no lseg, falling back to MDS\n", __func__);
+		return PNFS_NOT_ATTEMPTED;
+	}
+	if (dont_like_caller(wdata->req)) {
+		dprintk("%s dont_like_caller failed\n", __func__);
+		return PNFS_NOT_ATTEMPTED;
+	}
+	/* At this point, wdata->pages is a (sequential) list of nfs_pages.
+	 * We want to write each, and if there is an error remove it from
+	 * list and call
+	 * nfs_retry_request(req) to have it redone using nfs.
+	 * QUEST? Do as block or per req?  Think have to do per block
+	 * as part of end_bio
+	 */
+	par = alloc_parallel(wdata);
+	if (!par)
+		return PNFS_NOT_ATTEMPTED;
+	par->call_ops = *wdata->mds_ops;
+	par->call_ops.rpc_call_done = bl_rpc_do_nothing;
+	par->pnfs_callback = bl_end_par_io_write;
+	/* At this point, have to be more careful with error handling */
+
+	isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> 9);
+	for (i = pg_index; i < wdata->npages ; i++) {
+		if (!extent_length) {
+			/* We've used up the previous extent */
+			put_extent(be);
+			bio = bl_submit_bio(WRITE, bio);
+			/* Get the next one */
+			be = find_get_extent(BLK_LSEG2EXT(wdata->lseg),
+					     isect, NULL);
+			if (!be || !is_writable(be, isect)) {
+				/* FIXME */
+				bl_done_with_wpage(pages[i], 0);
+				break;
+			}
+			extent_length = be->be_length -
+				(isect - be->be_f_offset);
+		}
+		for (;;) {
+			if (!bio) {
+				bio = bio_alloc(GFP_NOIO, wdata->npages - i);
+				if (!bio) {
+					/* Error out this page */
+					/* FIXME */
+					bl_done_with_wpage(pages[i], 0);
+					break;
+				}
+				bio->bi_sector = isect - be->be_f_offset +
+					be->be_v_offset;
+				bio->bi_bdev = be->be_mdev;
+				bio->bi_end_io = bl_end_io_write;
+				bio->bi_private = par;
+			}
+			if (bio_add_page(bio, pages[i], PAGE_SIZE, 0))
+				break;
+			bio = bl_submit_bio(WRITE, bio);
+		}
+		isect += PAGE_CACHE_SIZE >> 9;
+		extent_length -= PAGE_CACHE_SIZE >> 9;
+	}
+	wdata->res.count = (isect << 9) - (offset);
+	if (count < wdata->res.count) {
+		wdata->res.count = count;
+	}
+	/* pnfs_set_layoutcommit needs this */
+	wdata->mds_offset = offset;
+	put_extent(be);
+	bl_submit_bio(WRITE, bio);
+	put_parallel(par);
+	return PNFS_ATTEMPTED;
 }
 
 /* FIXME - range ignored */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 31/34] pnfsblock: note written INVAL areas for layoutcommit
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (29 preceding siblings ...)
  2011-06-12 23:45 ` [PATCH 30/34] pnfsblock: bl_write_pagelist Jim Rees
@ 2011-06-12 23:45 ` Jim Rees
  2011-06-12 23:45 ` [PATCH 32/34] pnfsblock: Implement release_inval_marks Jim Rees
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:45 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

[SQUASHME: pnfs: blocklayout: port block layout code]
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |   32 +++++++++++++
 fs/nfs/blocklayout/blocklayout.h |    2 +
 fs/nfs/blocklayout/extents.c     |   95 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 03ce7c3..242c232 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -320,6 +320,30 @@ bl_read_pagelist(struct nfs_read_data *rdata)
 	return PNFS_NOT_ATTEMPTED;
 }
 
+static void mark_extents_written(struct pnfs_block_layout *bl,
+				 __u64 offset, __u32 count)
+{
+	sector_t isect, end;
+	struct pnfs_block_extent *be;
+
+	dprintk("%s(%llu, %u)\n", __func__, offset, count);
+	if (count == 0)
+		return;
+	isect = (offset & (long)(PAGE_CACHE_MASK)) >> 9;
+	end = (offset + count + PAGE_CACHE_SIZE - 1) & (long)(PAGE_CACHE_MASK);
+	end >>= 9;
+	while (isect < end) {
+		sector_t len;
+		be = find_get_extent(bl, isect, NULL);
+		BUG_ON(!be); /* FIXME */
+		len = min(end, be->be_f_offset + be->be_length) - isect;
+		if (be->be_state == PNFS_BLOCK_INVALID_DATA)
+			mark_for_commit(be, isect, len); /* What if fails? */
+		isect += len;
+		put_extent(be);
+	}
+}
+
 /* STUB - this needs thought */
 static inline void
 bl_done_with_wpage(struct page *page, const int ok)
@@ -367,6 +391,14 @@ static void bl_write_cleanup(struct work_struct *work)
 	dprintk("%s enter\n", __func__);
 	task = container_of(work, struct rpc_task, u.tk_work);
 	wdata = container_of(task, struct nfs_write_data, task);
+	if (!wdata->task.tk_status) {
+		/* Marks for LAYOUTCOMMIT */
+		/* BUG - this should be called after each bio, not after
+		 * all finish, unless have some way of storing success/failure
+		 */
+		mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
+				     wdata->args.offset, wdata->args.count);
+	}
 	pnfs_ld_write_done(wdata);
 }
 
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 3b3e70a..163125c 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -267,6 +267,8 @@ void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
 				   int status);
 int add_and_merge_extent(struct pnfs_block_layout *bl,
 			 struct pnfs_block_extent *new);
+int mark_for_commit(struct pnfs_block_extent *be,
+		    sector_t offset, sector_t length);
 
 #include <linux/sunrpc/simple_rpc_pipefs.h>
 
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 1447bfc..a62d29f 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -217,6 +217,48 @@ int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
 	return rv;
 }
 
+/* Assume start, end already sector aligned */
+static int
+_range_has_tag(struct my_tree_t *tree, u64 start, u64 end, int32_t tag)
+{
+	struct pnfs_inval_tracking *pos;
+	u64 expect = 0;
+
+	dprintk("%s(%llu, %llu, %i) enter\n", __func__, start, end, tag);
+	list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+		if (pos->it_sector >= end)
+			continue;
+		if (!expect) {
+			if ((pos->it_sector == end - tree->mtt_step_size) &&
+			    (pos->it_tags & (1 << tag))) {
+				expect = pos->it_sector - tree->mtt_step_size;
+				if (pos->it_sector < tree->mtt_step_size || expect < start)
+					return 1;
+				continue;
+			} else {
+				return 0;
+			}
+		}
+		if (pos->it_sector != expect || !(pos->it_tags & (1 << tag)))
+			return 0;
+		expect -= tree->mtt_step_size;
+		if (expect < start)
+			return 1;
+	}
+	return 0;
+}
+
+static int is_range_written(struct pnfs_inval_markings *marks,
+			    sector_t start, sector_t end)
+{
+	int rv;
+
+	spin_lock(&marks->im_lock);
+	rv = _range_has_tag(&marks->im_tree, start, end, EXTENT_WRITTEN);
+	spin_unlock(&marks->im_lock);
+	return rv;
+}
+
 /* Marks sectors in [offest, offset_length) as having been initialized.
  * All lengths are step-aligned, where step is min(pagesize, blocksize).
  * Notes where partial block is initialized, and helps prepare it for
@@ -394,6 +436,59 @@ static void add_to_commitlist(struct pnfs_block_layout *bl,
 	print_clist(clist, bl->bl_count);
 }
 
+/* Note the range described by offset, length is guaranteed to be contained
+ * within be.
+ */
+int mark_for_commit(struct pnfs_block_extent *be,
+		    sector_t offset, sector_t length)
+{
+	sector_t new_end, end = offset + length;
+	struct pnfs_block_short_extent *new;
+	struct pnfs_block_layout *bl = container_of(be->be_inval,
+						    struct pnfs_block_layout,
+						    bl_inval);
+
+	new = kmalloc(sizeof(*new), GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+
+	mark_written_sectors(be->be_inval, offset, length);
+	/* We want to add the range to commit list, but it must be
+	 * block-normalized, and verified that the normalized range has
+	 * been entirely written to disk.
+	 */
+	new->bse_f_offset = offset;
+	offset = normalize(offset, bl->bl_blocksize);
+	if (offset < new->bse_f_offset) {
+		if (is_range_written(be->be_inval, offset, new->bse_f_offset))
+			new->bse_f_offset = offset;
+		else
+			new->bse_f_offset = offset + bl->bl_blocksize;
+	}
+	new_end = normalize_up(end, bl->bl_blocksize);
+	if (end < new_end) {
+		if (is_range_written(be->be_inval, end, new_end))
+			end = new_end;
+		else
+			end = new_end - bl->bl_blocksize;
+	}
+	if (end <= new->bse_f_offset) {
+		kfree(new);
+		return 0;
+	}
+	new->bse_length = end - new->bse_f_offset;
+	new->bse_devid = be->be_devid;
+	new->bse_mdev = be->be_mdev;
+
+	spin_lock(&bl->bl_ext_lock);
+	/* new will be freed, either by add_to_commitlist if it decides not
+	 * to use it, or after LAYOUTCOMMIT uses it in the commitlist.
+	 */
+	add_to_commitlist(bl, new);
+	spin_unlock(&bl->bl_ext_lock);
+	return 0;
+}
+
 static void print_bl_extent(struct pnfs_block_extent *be)
 {
 	dprintk("PRINT EXTENT extent %p\n", be);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 32/34] pnfsblock: Implement release_inval_marks
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (30 preceding siblings ...)
  2011-06-12 23:45 ` [PATCH 31/34] pnfsblock: note written INVAL areas for layoutcommit Jim Rees
@ 2011-06-12 23:45 ` Jim Rees
  2011-06-12 23:45 ` [PATCH 33/34] Add configurable prefetch size for layoutget Jim Rees
  2011-06-12 23:45 ` [PATCH 34/34] NFS41: do not update isize if inode needs layoutcommit Jim Rees
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:45 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>

Leaving it unimplemented will cause memory leak.

Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 242c232..f201191 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -525,10 +525,15 @@ release_extents(struct pnfs_block_layout *bl,
         spin_unlock(&bl->bl_ext_lock);
 }
 
-/* STUB */
 static void
 release_inval_marks(struct pnfs_inval_markings *marks)
 {
+	struct pnfs_inval_tracking *pos, *temp;
+
+	list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
+		list_del(&pos->it_link);
+		kfree(pos);
+	}
 	return;
 }
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 33/34] Add configurable prefetch size for layoutget
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (31 preceding siblings ...)
  2011-06-12 23:45 ` [PATCH 32/34] pnfsblock: Implement release_inval_marks Jim Rees
@ 2011-06-12 23:45 ` Jim Rees
  2011-06-12 23:45 ` [PATCH 34/34] NFS41: do not update isize if inode needs layoutcommit Jim Rees
  33 siblings, 0 replies; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:45 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Peng Tao <peng_tao@emc.com>

pnfs_layout_prefetch_kb can be modified via sysctl.
default to 0 so no effect if not set via sysctl.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/pnfs.c   |   17 +++++++++++++++++
 fs/nfs/pnfs.h   |    1 +
 fs/nfs/sysctl.c |   10 ++++++++++
 3 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 48a06a1..7b0c8dd 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -46,6 +46,11 @@ static DEFINE_SPINLOCK(pnfs_spinlock);
  */
 static LIST_HEAD(pnfs_modules_tbl);
 
+/*
+ * layoutget prefetch size
+ */
+unsigned int pnfs_layout_prefetch_kb = 0;
+
 /* Return the registered pnfs layout driver module matching given id */
 static struct pnfs_layoutdriver_type *
 find_pnfs_driver_locked(u32 id)
@@ -908,6 +913,16 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
 }
 
 /*
+ * Set layout prefetch length.
+ */
+static void
+pnfs_set_layout_prefetch(struct pnfs_layout_range *range)
+{
+	if (range->length < (pnfs_layout_prefetch_kb << 10))
+		range->length = pnfs_layout_prefetch_kb << 10;
+}
+
+/*
  * Layout segment is retreived from the server if not cached.
  * The appropriate layout segment is referenced and returned to the caller.
  */
@@ -958,6 +973,8 @@ pnfs_update_layout(struct inode *ino,
 
 	if (pnfs_layoutgets_blocked(lo, NULL, 0))
 		goto out_unlock;
+
+	pnfs_set_layout_prefetch(&arg);
 	atomic_inc(&lo->plh_outstanding);
 
 	get_layout_hdr(lo);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 5048898..e12a77de 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -179,6 +179,7 @@ extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
 extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp);
 
 /* pnfs.c */
+extern unsigned int pnfs_layout_prefetch_kb;
 void get_layout_hdr(struct pnfs_layout_hdr *lo);
 void put_lseg(struct pnfs_layout_segment *lseg);
 struct pnfs_layout_segment *
diff --git a/fs/nfs/sysctl.c b/fs/nfs/sysctl.c
index 978aaeb..79a5134 100644
--- a/fs/nfs/sysctl.c
+++ b/fs/nfs/sysctl.c
@@ -14,6 +14,7 @@
 #include <linux/nfs_fs.h>
 
 #include "callback.h"
+#include "pnfs.h"
 
 #ifdef CONFIG_NFS_V4
 static const int nfs_set_port_min = 0;
@@ -42,6 +43,15 @@ static ctl_table nfs_cb_sysctls[] = {
 	},
 #endif /* CONFIG_NFS_USE_NEW_IDMAPPER */
 #endif
+#ifdef CONFIG_NFS_V4_1
+	{
+		.procname	= "pnfs_layout_prefetch_kb",
+		.data		= &pnfs_layout_prefetch_kb,
+		.maxlen		= sizeof(pnfs_layout_prefetch_kb),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{
 		.procname	= "nfs_mountpoint_timeout",
 		.data		= &nfs_mountpoint_expiry_timeout,
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 34/34] NFS41: do not update isize if inode needs layoutcommit
  2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
                   ` (32 preceding siblings ...)
  2011-06-12 23:45 ` [PATCH 33/34] Add configurable prefetch size for layoutget Jim Rees
@ 2011-06-12 23:45 ` Jim Rees
  2011-06-14 16:15   ` Benny Halevy
  33 siblings, 1 reply; 58+ messages in thread
From: Jim Rees @ 2011-06-12 23:45 UTC (permalink / raw)
  To: linux-nfs; +Cc: peter honeyman

From: Peng Tao <bergwolf@gmail.com>

Layout commit is supposed to set server file size similiar to nfs pages.
We should not update client file size for the same reason.
Otherwise we will lose what we have at hand.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/inode.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 144f2a3..3f1eb81 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1294,7 +1294,8 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 		if (new_isize != cur_isize) {
 			/* Do we perhaps have any outstanding writes, or has
 			 * the file grown beyond our last write? */
-			if (nfsi->npages == 0 || new_isize > cur_isize) {
+			if ((nfsi->npages == 0 && !test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) ||
+			     new_isize > cur_isize) {
 				i_size_write(inode, new_isize);
 				invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
 			}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments
  2011-06-12 23:43 ` [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments Jim Rees
@ 2011-06-13 14:36   ` Fred Isaman
  2011-06-14 10:40     ` tao.peng
  0 siblings, 1 reply; 58+ messages in thread
From: Fred Isaman @ 2011-06-13 14:36 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
> From: Peng Tao <bergwolf@gmail.com>
>
> Some layout driver like block will have multiple segments.
> Generic code should be able to handle it.
>
> Signed-off-by: Peng Tao <peng_tao@emc.com>
> ---
>  fs/nfs/pnfs.c |   17 +++++++++++++----
>  fs/nfs/pnfs.h |    1 +
>  2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index e3d618b..f03a5e0 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -892,7 +892,7 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
>        dprintk("%s:Begin\n", __func__);
>
>        assert_spin_locked(&lo->plh_inode->i_lock);
> -       list_for_each_entry(lseg, &lo->plh_segs, pls_list) {
> +       list_for_each_entry_reverse(lseg, &lo->plh_segs, pls_list) {

This is a sortred list, and the order of search matters.  You can't
just reverse it here.

>                if (test_bit(NFS_LSEG_VALID, &lseg->pls_flags) &&
>                    is_matching_lseg(&lseg->pls_range, range)) {
>                        ret = get_lseg(lseg);
> @@ -1193,10 +1193,18 @@ pnfs_try_to_read_data(struct nfs_read_data *rdata,
>  static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
>  {
>        struct pnfs_layout_segment *lseg, *rv = NULL;
> +       loff_t max_pos = 0;
> +
> +       list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
> +               if (lseg->pls_range.iomode == IOMODE_RW) {
> +                       if (max_pos < lseg->pls_end_pos)
> +                               max_pos = lseg->pls_end_pos;
> +                       if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags))
> +                               rv = lseg;
> +               }
> +       }
> +       rv->pls_end_pos = max_pos;
>

The idea here was that it could be extended to use segment by
returning a list of affected lsegs,
not so,e random one.  Because otherwise you have problems with the
fact that relevant but not
returned lsegs are going to get there refcounts messed up.

Fred

> -       list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
> -               if (lseg->pls_range.iomode == IOMODE_RW)
> -                       rv = lseg;
>        return rv;
>  }
>
> @@ -1211,6 +1219,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>        if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
>                /* references matched in nfs4_layoutcommit_release */
>                get_lseg(wdata->lseg);
> +               set_bit(NFS_LSEG_LAYOUTCOMMIT, &wdata->lseg->pls_flags);
>                wdata->lseg->pls_lc_cred =
>                        get_rpccred(wdata->args.context->state->owner->so_cred);
>                mark_as_dirty = true;
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index b071b56..a3fc0f2 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -36,6 +36,7 @@
>  enum {
>        NFS_LSEG_VALID = 0,     /* cleared when lseg is recalled/returned */
>        NFS_LSEG_ROC,           /* roc bit received from server */
> +       NFS_LSEG_LAYOUTCOMMIT,  /* layoutcommit bit set for layoutcommit */
>  };
>
>  struct pnfs_layout_segment {
> --
> 1.7.4.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation
  2011-06-12 23:43 ` [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation Jim Rees
@ 2011-06-13 14:44   ` Fred Isaman
  2011-06-14 11:01     ` tao.peng
  0 siblings, 1 reply; 58+ messages in thread
From: Fred Isaman @ 2011-06-13 14:44 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
> From: Peng Tao <bergwolf@gmail.com>
>
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> Reported-by: Alexandros Batsakis <batsakis@netapp.com>
> Signed-off-by: Andy Adamson <andros@netapp.com>
> Signed-off-by: Fred Isaman <iisaman@netapp.com>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> Signed-off-by: Peng Tao <bergwolf@gmail.com>
> ---
>  fs/nfs/file.c          |   26 ++++++++++-
>  fs/nfs/pnfs.c          |   41 +++++++++++++++++
>  fs/nfs/pnfs.h          |  115 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nfs/write.c         |   12 +++--
>  include/linux/nfs_fs.h |    3 +-
>  5 files changed, 189 insertions(+), 8 deletions(-)
>
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 2f093ed..1768762 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -384,12 +384,15 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
>        pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>        struct page *page;
>        int once_thru = 0;
> +       struct pnfs_layout_segment *lseg;
>
>        dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
>                file->f_path.dentry->d_parent->d_name.name,
>                file->f_path.dentry->d_name.name,
>                mapping->host->i_ino, len, (long long) pos);
> -
> +       lseg = pnfs_update_layout(mapping->host,
> +                                 nfs_file_open_context(file),
> +                                 pos, len, IOMODE_RW, GFP_NOFS);


This looks like it is left over from before the rearrangements done to
where pnfs_update_layout.
In particular, we don't want to hold the reference on the lseg from
here until flush time.  And there
seems to be no reason to.  If the client needs a layout to deal with
read-in here, it should instead
trigger the nfs_want_read_modify_write clause.

Fred

>  start:
>        /*
>         * Prevent starvation issues if someone is doing a consistency
> @@ -409,6 +412,9 @@ start:
>        if (ret) {
>                unlock_page(page);
>                page_cache_release(page);
> +               *pagep = NULL;
> +               *fsdata = NULL;
> +               goto out;
>        } else if (!once_thru &&
>                   nfs_want_read_modify_write(file, page, pos, len)) {
>                once_thru = 1;
> @@ -417,6 +423,12 @@ start:
>                if (!ret)
>                        goto start;
>        }
> +       ret = pnfs_write_begin(file, page, pos, len, lseg, fsdata);
> + out:
> +       if (ret) {
> +               put_lseg(lseg);
> +               *fsdata = NULL;
> +       }
>        return ret;
>  }
>
> @@ -426,6 +438,7 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
>  {
>        unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
>        int status;
> +       struct pnfs_layout_segment *lseg;
>
>        dfprintk(PAGECACHE, "NFS: write_end(%s/%s(%ld), %u@%lld)\n",
>                file->f_path.dentry->d_parent->d_name.name,
> @@ -452,10 +465,17 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
>                        zero_user_segment(page, pglen, PAGE_CACHE_SIZE);
>        }
>
> -       status = nfs_updatepage(file, page, offset, copied);
> +       lseg = nfs4_pull_lseg_from_fsdata(file, fsdata);
> +       status = pnfs_write_end(file, page, pos, len, copied, lseg);
> +       if (status)
> +               goto out;
> +       status = nfs_updatepage(file, page, offset, copied, lseg, fsdata);
>
> +out:
>        unlock_page(page);
>        page_cache_release(page);
> +       pnfs_write_end_cleanup(file, fsdata);
> +       put_lseg(lseg);
>
>        if (status < 0)
>                return status;
> @@ -577,7 +597,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>
>        ret = VM_FAULT_LOCKED;
>        if (nfs_flush_incompatible(filp, page) == 0 &&
> -           nfs_updatepage(filp, page, 0, pagelen) == 0)
> +           nfs_updatepage(filp, page, 0, pagelen, NULL, NULL) == 0)
>                goto out;
>
>        ret = VM_FAULT_SIGBUS;
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index f03a5e0..e693718 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1138,6 +1138,41 @@ pnfs_try_to_write_data(struct nfs_write_data *wdata,
>  }
>
>  /*
> + * This gives the layout driver an opportunity to read in page "around"
> + * the data to be written.  It returns 0 on success, otherwise an error code
> + * which will either be passed up to user, or ignored if
> + * some previous part of write succeeded.
> + * Note the range [pos, pos+len-1] is entirely within the page.
> + */
> +int _pnfs_write_begin(struct inode *inode, struct page *page,
> +                     loff_t pos, unsigned len,
> +                     struct pnfs_layout_segment *lseg,
> +                     struct pnfs_fsdata **fsdata)
> +{
> +       struct pnfs_fsdata *data;
> +       int status = 0;
> +
> +       dprintk("--> %s: pos=%llu len=%u\n",
> +               __func__, (unsigned long long)pos, len);
> +       data = kzalloc(sizeof(struct pnfs_fsdata), GFP_KERNEL);
> +       if (!data) {
> +               status = -ENOMEM;
> +               goto out;
> +       }
> +       data->lseg = lseg; /* refcount passed into data to be managed there */
> +       status = NFS_SERVER(inode)->pnfs_curr_ld->write_begin(
> +                                               lseg, page, pos, len, data);
> +       if (status) {
> +               kfree(data);
> +               data = NULL;
> +       }
> +out:
> +       *fsdata = data;
> +       dprintk("<-- %s: status=%d\n", __func__, status);
> +       return status;
> +}
> +
> +/*
>  * Called by non rpc-based layout drivers
>  */
>  int
> @@ -1237,6 +1272,12 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  }
>  EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
>
> +void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
> +{
> +       /* lseg refcounting handled directly in nfs_write_end */
> +       kfree(fsdata);
> +}
> +
>  /*
>  * For the LAYOUT4_NFSV4_1_FILES layout type, NFS_DATA_SYNC WRITEs and
>  * NFS_UNSTABLE WRITEs with a COMMIT to data servers must store enough
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index a3fc0f2..525ec55 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -54,6 +54,12 @@ enum pnfs_try_status {
>        PNFS_NOT_ATTEMPTED = 1,
>  };
>
> +struct pnfs_fsdata {
> +       struct pnfs_layout_segment *lseg;
> +       int bypass_eof;
> +       void *private;
> +};
> +
>  #ifdef CONFIG_NFS_V4_1
>
>  #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
> @@ -106,6 +112,14 @@ struct pnfs_layoutdriver_type {
>         */
>        enum pnfs_try_status (*read_pagelist) (struct nfs_read_data *nfs_data);
>        enum pnfs_try_status (*write_pagelist) (struct nfs_write_data *nfs_data, int how);
> +       int (*write_begin) (struct pnfs_layout_segment *lseg, struct page *page,
> +                           loff_t pos, unsigned count,
> +                           struct pnfs_fsdata *fsdata);
> +       int (*write_end)(struct inode *inode, struct page *page, loff_t pos,
> +                        unsigned count, unsigned copied,
> +                        struct pnfs_layout_segment *lseg);
> +       void (*write_end_cleanup)(struct file *filp,
> +                                 struct pnfs_fsdata *fsdata);
>
>        void (*free_deviceid_node) (struct nfs4_deviceid_node *);
>
> @@ -175,6 +189,7 @@ enum pnfs_try_status pnfs_try_to_write_data(struct nfs_write_data *,
>  enum pnfs_try_status pnfs_try_to_read_data(struct nfs_read_data *,
>                                            const struct rpc_call_ops *);
>  bool pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, struct nfs_page *req);
> +void pnfs_free_fsdata(struct pnfs_fsdata *fsdata);
>  int pnfs_layout_process(struct nfs4_layoutget *lgp);
>  void pnfs_free_lseg_list(struct list_head *tmp_list);
>  void pnfs_destroy_layout(struct nfs_inode *);
> @@ -186,6 +201,10 @@ void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
>  int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
>                                  struct pnfs_layout_hdr *lo,
>                                  struct nfs4_state *open_state);
> +int _pnfs_write_begin(struct inode *inode, struct page *page,
> +                     loff_t pos, unsigned len,
> +                     struct pnfs_layout_segment *lseg,
> +                     struct pnfs_fsdata **fsdata);
>  int mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
>                                struct list_head *tmp_list,
>                                struct pnfs_layout_range *recall_range);
> @@ -287,6 +306,13 @@ static inline void pnfs_clear_request_commit(struct nfs_page *req)
>                put_lseg(req->wb_commit_lseg);
>  }
>
> +static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
> +                              struct pnfs_fsdata *fsdata)
> +{
> +       return !fsdata  || ((struct pnfs_layout_segment *)fsdata == lseg) ||
> +               !fsdata->bypass_eof;
> +}
> +
>  /* Should the pNFS client commit and return the layout upon a setattr */
>  static inline bool
>  pnfs_ld_layoutret_on_setattr(struct inode *inode)
> @@ -297,6 +323,49 @@ pnfs_ld_layoutret_on_setattr(struct inode *inode)
>                PNFS_LAYOUTRET_ON_SETATTR;
>  }
>
> +static inline int pnfs_write_begin(struct file *filp, struct page *page,
> +                                  loff_t pos, unsigned len,
> +                                  struct pnfs_layout_segment *lseg,
> +                                  void **fsdata)
> +{
> +       struct inode *inode = filp->f_dentry->d_inode;
> +       struct nfs_server *nfss = NFS_SERVER(inode);
> +       int status = 0;
> +
> +       *fsdata = lseg;
> +       if (lseg && nfss->pnfs_curr_ld->write_begin)
> +               status = _pnfs_write_begin(inode, page, pos, len, lseg,
> +                                          (struct pnfs_fsdata **) fsdata);
> +       return status;
> +}
> +
> +/* CAREFUL - what happens if copied < len??? */
> +static inline int pnfs_write_end(struct file *filp, struct page *page,
> +                                loff_t pos, unsigned len, unsigned copied,
> +                                struct pnfs_layout_segment *lseg)
> +{
> +       struct inode *inode = filp->f_dentry->d_inode;
> +       struct nfs_server *nfss = NFS_SERVER(inode);
> +
> +       if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_end)
> +               return nfss->pnfs_curr_ld->write_end(inode, page, pos, len,
> +                                                    copied, lseg);
> +       else
> +               return 0;
> +}
> +
> +static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
> +{
> +       struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
> +
> +       if (fsdata && nfss->pnfs_curr_ld) {
> +               if (nfss->pnfs_curr_ld->write_end_cleanup)
> +                       nfss->pnfs_curr_ld->write_end_cleanup(filp, fsdata);
> +               if (nfss->pnfs_curr_ld->write_begin)
> +                       pnfs_free_fsdata(fsdata);
> +       }
> +}
> +
>  static inline int pnfs_return_layout(struct inode *ino)
>  {
>        struct nfs_inode *nfsi = NFS_I(ino);
> @@ -317,6 +386,19 @@ static inline void pnfs_pageio_init(struct nfs_pageio_descriptor *pgio,
>                pgio->pg_test = ld->pg_test;
>  }
>
> +static inline struct pnfs_layout_segment *
> +nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
> +{
> +       if (fsdata) {
> +               struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
> +
> +               if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_begin)
> +                       return ((struct pnfs_fsdata *) fsdata)->lseg;
> +               return (struct pnfs_layout_segment *)fsdata;
> +       }
> +       return NULL;
> +}
> +
>  #else  /* CONFIG_NFS_V4_1 */
>
>  static inline void pnfs_destroy_all_layouts(struct nfs_client *clp)
> @@ -345,6 +427,12 @@ pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
>        return NULL;
>  }
>
> +static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
> +                              struct pnfs_fsdata *fsdata)
> +{
> +       return 1;
> +}
> +
>  static inline enum pnfs_try_status
>  pnfs_try_to_read_data(struct nfs_read_data *data,
>                      const struct rpc_call_ops *call_ops)
> @@ -364,6 +452,26 @@ static inline int pnfs_return_layout(struct inode *ino)
>        return 0;
>  }
>
> +static inline int pnfs_write_begin(struct file *filp, struct page *page,
> +                                  loff_t pos, unsigned len,
> +                                  struct pnfs_layout_segment *lseg,
> +                                  void **fsdata)
> +{
> +       *fsdata = NULL;
> +       return 0;
> +}
> +
> +static inline int pnfs_write_end(struct file *filp, struct page *page,
> +                                loff_t pos, unsigned len, unsigned copied,
> +                                struct pnfs_layout_segment *lseg)
> +{
> +       return 0;
> +}
> +
> +static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
> +{
> +}
> +
>  static inline bool
>  pnfs_ld_layoutret_on_setattr(struct inode *inode)
>  {
> @@ -435,6 +543,13 @@ static inline int pnfs_layoutcommit_inode(struct inode *inode, bool sync)
>  static inline void nfs4_deviceid_purge_client(struct nfs_client *ncl)
>  {
>  }
> +
> +static inline struct pnfs_layout_segment *
> +nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
> +{
> +       return NULL;
> +}
> +
>  #endif /* CONFIG_NFS_V4_1 */
>
>  #endif /* FS_NFS_PNFS_H */
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index e268e3b..75e2a6b 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -673,7 +673,9 @@ out:
>  }
>
>  static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
> -               unsigned int offset, unsigned int count)
> +               unsigned int offset, unsigned int count,
> +               struct pnfs_layout_segment *lseg, void *fsdata)
> +
>  {
>        struct nfs_page *req;
>
> @@ -681,7 +683,8 @@ static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
>        if (IS_ERR(req))
>                return PTR_ERR(req);
>        /* Update file length */
> -       nfs_grow_file(page, offset, count);
> +       if (pnfs_grow_ok(lseg, fsdata))
> +               nfs_grow_file(page, offset, count);
>        nfs_mark_uptodate(page, req->wb_pgbase, req->wb_bytes);
>        nfs_mark_request_dirty(req);
>        nfs_clear_page_tag_locked(req);
> @@ -734,7 +737,8 @@ static int nfs_write_pageuptodate(struct page *page, struct inode *inode)
>  * things with a page scheduled for an RPC call (e.g. invalidate it).
>  */
>  int nfs_updatepage(struct file *file, struct page *page,
> -               unsigned int offset, unsigned int count)
> +               unsigned int offset, unsigned int count,
> +               struct pnfs_layout_segment *lseg, void *fsdata)
>  {
>        struct nfs_open_context *ctx = nfs_file_open_context(file);
>        struct inode    *inode = page->mapping->host;
> @@ -759,7 +763,7 @@ int nfs_updatepage(struct file *file, struct page *page,
>                offset = 0;
>        }
>
> -       status = nfs_writepage_setup(ctx, page, offset, count);
> +       status = nfs_writepage_setup(ctx, page, offset, count, lseg, fsdata);
>        if (status < 0)
>                nfs_set_pageerror(page);
>
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index 1b93b9c..be1ac1d 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -510,7 +510,8 @@ extern int  nfs_congestion_kb;
>  extern int  nfs_writepage(struct page *page, struct writeback_control *wbc);
>  extern int  nfs_writepages(struct address_space *, struct writeback_control *);
>  extern int  nfs_flush_incompatible(struct file *file, struct page *page);
> -extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int);
> +extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int,
> +                       struct pnfs_layout_segment *, void *);
>  extern void nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
>
>  /*
> --
> 1.7.4.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 18/34] pnfsblock: allow use of PG_owner_priv_1 flag
  2011-06-12 23:44 ` [PATCH 18/34] pnfsblock: allow use of PG_owner_priv_1 flag Jim Rees
@ 2011-06-13 15:56   ` Fred Isaman
  0 siblings, 0 replies; 58+ messages in thread
From: Fred Isaman @ 2011-06-13 15:56 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On Sun, Jun 12, 2011 at 7:44 PM, Jim Rees <rees@umich.edu> wrote:
> From: Fred Isaman <iisaman@citi.umich.edu>
>
> There is currently no good way for pnfs to communicate problems.  For
> example - the linux read code first tries to do readahead through
> nfs_readpages. Failure there is ignored, and it will later call
> nfs_readpage.  Failure there is also ignored, except that the lack of
> PG_uptodate is communicated back via -EIO.
>
> With pnfs, it would be useful to be able to communicate to
> nfs_readpage that direct disk IO failed on readahead, and that it
> should failover to using the MDS.
>
> Making the page flag PG_owner_priv_1 available as PG_pnfserr is one
> way to do so. (An alternative would be to embed this in the layout,
> but then pg_test can't easily access the info.)
>
> This may be better as generic pnfs code, in which case it should be
> put in pnfs.h, or even page-flags.h
>
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> ---

The error handling has changed a bit since this was written.  It is
currently very course grained for simplicity reasons.
Is there any reason you can't do similarly, and just use
NFS_LAYOUT_{RW|RO}_FAILED in lo->plh_flags?

Fred

>  fs/nfs/blocklayout/blocklayout.h |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index 21fa21c..293f009 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -35,6 +35,11 @@
>  #include <linux/nfs_fs.h>
>  #include "../pnfs.h"
>
> +#define PG_pnfserr PG_owner_priv_1
> +#define PagePnfsErr(page)      test_bit(PG_pnfserr, &(page)->flags)
> +#define SetPagePnfsErr(page)   set_bit(PG_pnfserr, &(page)->flags)
> +#define ClearPagePnfsErr(page) clear_bit(PG_pnfserr, &(page)->flags)
> +
>  struct block_mount_id {
>        spinlock_t                      bm_lock;    /* protects list */
>        struct list_head                bm_devlist; /* holds pnfs_block_dev */
> --
> 1.7.4.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/34] pnfs: cleanup_layoutcommit
  2011-06-12 23:44 ` [PATCH 06/34] pnfs: cleanup_layoutcommit Jim Rees
@ 2011-06-13 21:19   ` Benny Halevy
  2011-06-14 15:16     ` Peng Tao
  2011-06-14 15:10   ` Benny Halevy
  2011-06-14 15:19   ` Benny Halevy
  2 siblings, 1 reply; 58+ messages in thread
From: Benny Halevy @ 2011-06-13 21:19 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On 2011-06-12 19:44, Jim Rees wrote:
> From: Peng Tao <bergwolf@gmail.com>
> 
> This gives layout driver a chance to cleanup structures they put in.
> Also ensure layoutcommit does not commit more than isize, as block layout
> driver may dirty pages beyond EOF.
> 
> Signed-off-by: Andy Adamson <andros@netapp.com>
> [fixup layout header pointer for layoutcommit]
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> Signed-off-by: Peng Tao <bergwolf@gmail.com>
> ---
>  fs/nfs/nfs4proc.c       |    1 +
>  fs/nfs/nfs4xdr.c        |    3 ++-
>  fs/nfs/pnfs.c           |   15 +++++++++++++++
>  fs/nfs/pnfs.h           |    4 ++++
>  include/linux/nfs_xdr.h |    1 +
>  5 files changed, 23 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 5246db8..e27a648 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5890,6 +5890,7 @@ static void nfs4_layoutcommit_release(void *calldata)
>  {
>  	struct nfs4_layoutcommit_data *data = calldata;
>  
> +	pnfs_cleanup_layoutcommit(data->args.inode, data);

The layout driver better be passed the status on the done method
rather than on release so that it can roll back on error.

Although it is quite complicated to roll back after permanent errors like
NFS4ERR_BADLAYOUT where the client is really screwed and it
essentially needs to redirty and rewrite the data (to the MDS
to simplify the error handling path), rolling back from
transient errors like NFS4ERR_DELAY should be fairly easy.

Benny

>  	/* Matched by references in pnfs_set_layoutcommit */
>  	put_lseg(data->lseg);
>  	put_rpccred(data->cred);
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index fdcbd8f..57295d1 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
>  	*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
>  	/* Only whole file layouts */
>  	p = xdr_encode_hyper(p, 0); /* offset */
> -	p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
> +	p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
>  	*p++ = cpu_to_be32(0); /* reclaim */
>  	p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
>  	*p++ = cpu_to_be32(1); /* newoffset = TRUE */
> @@ -5467,6 +5467,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
>  	int status;
>  
>  	status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
> +	res->status = status;
>  	if (status)
>  		return status;
>  
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index e693718..48a06a1 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1248,6 +1248,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  {
>  	struct nfs_inode *nfsi = NFS_I(wdata->inode);
>  	loff_t end_pos = wdata->mds_offset + wdata->res.count;
> +	loff_t isize = i_size_read(wdata->inode);
>  	bool mark_as_dirty = false;
>  
>  	spin_lock(&nfsi->vfs_inode.i_lock);
> @@ -1261,9 +1262,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  		dprintk("%s: Set layoutcommit for inode %lu ",
>  			__func__, wdata->inode->i_ino);
>  	}
> +	if (end_pos > isize)
> +		end_pos = isize;
>  	if (end_pos > wdata->lseg->pls_end_pos)
>  		wdata->lseg->pls_end_pos = end_pos;
>  	spin_unlock(&nfsi->vfs_inode.i_lock);
> +	dprintk("%s: lseg %p end_pos %llu\n",
> +		__func__, wdata->lseg, wdata->lseg->pls_end_pos);
>  
>  	/* if pnfs_layoutcommit_inode() runs between inode locks, the next one
>  	 * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
> @@ -1272,6 +1277,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  }
>  EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
>  
> +void pnfs_cleanup_layoutcommit(struct inode *inode,
> +                               struct nfs4_layoutcommit_data *data)
> +{
> +        struct nfs_server *nfss = NFS_SERVER(inode);
> +
> +        if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
> +                nfss->pnfs_curr_ld->cleanup_layoutcommit(
> +                                        NFS_I(inode)->layout, data);
> +}
> +
>  void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
>  {
>  	/* lseg refcounting handled directly in nfs_write_end */
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 525ec55..5048898 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -127,6 +127,9 @@ struct pnfs_layoutdriver_type {
>  				     struct xdr_stream *xdr,
>  				     const struct nfs4_layoutreturn_args *args);
>  
> +        void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
> +                                      struct nfs4_layoutcommit_data *data);
> +
>  	void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
>  				     struct xdr_stream *xdr,
>  				     const struct nfs4_layoutcommit_args *args);
> @@ -213,6 +216,7 @@ void pnfs_roc_release(struct inode *ino);
>  void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
>  bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
>  void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
> +void pnfs_cleanup_layoutcommit(struct inode *inode, struct nfs4_layoutcommit_data *data);
>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>  int _pnfs_return_layout(struct inode *);
>  int pnfs_ld_write_done(struct nfs_write_data *);
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index a9c43ba..2c3ffda 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -270,6 +270,7 @@ struct nfs4_layoutcommit_res {
>  	struct nfs_fattr *fattr;
>  	const struct nfs_server *server;
>  	struct nfs4_sequence_res seq_res;
> +	int status;
>  };
>  
>  struct nfs4_layoutcommit_data {

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments
  2011-06-13 14:36   ` Fred Isaman
@ 2011-06-14 10:40     ` tao.peng
  2011-06-14 13:58       ` Fred Isaman
  2011-06-14 14:28       ` Benny Halevy
  0 siblings, 2 replies; 58+ messages in thread
From: tao.peng @ 2011-06-14 10:40 UTC (permalink / raw)
  To: iisaman, rees; +Cc: linux-nfs, honey

Hi, Fred,

> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org]
> On Behalf Of Fred Isaman
> Sent: Monday, June 13, 2011 10:37 PM
> To: Jim Rees
> Cc: linux-nfs@vger.kernel.org; peter honeyman
> Subject: Re: [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments
> 
> On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
> > From: Peng Tao <bergwolf@gmail.com>
> >
> > Some layout driver like block will have multiple segments.
> > Generic code should be able to handle it.
> >
> > Signed-off-by: Peng Tao <peng_tao@emc.com>
> > ---
> >  fs/nfs/pnfs.c |   17 +++++++++++++----
> >  fs/nfs/pnfs.h |    1 +
> >  2 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> > index e3d618b..f03a5e0 100644
> > --- a/fs/nfs/pnfs.c
> > +++ b/fs/nfs/pnfs.c
> > @@ -892,7 +892,7 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
> >        dprintk("%s:Begin\n", __func__);
> >
> >        assert_spin_locked(&lo->plh_inode->i_lock);
> > -       list_for_each_entry(lseg, &lo->plh_segs, pls_list) {
> > +       list_for_each_entry_reverse(lseg, &lo->plh_segs, pls_list) {
> 
> This is a sortred list, and the order of search matters.  You can't
> just reverse it here.
The layout segment list is in offset increasing order. But the lookup code here assumes it's a decreasing ordered list.
To fix it, we should either reverse lookup the list, or change the break condition test. Otherwise lookup always fails if not matching the first one.

> 
> >                if (test_bit(NFS_LSEG_VALID, &lseg->pls_flags) &&
> >                    is_matching_lseg(&lseg->pls_range, range)) {
> >                        ret = get_lseg(lseg);
> > @@ -1193,10 +1193,18 @@ pnfs_try_to_read_data(struct nfs_read_data *rdata,
> >  static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
> >  {
> >        struct pnfs_layout_segment *lseg, *rv = NULL;
> > +       loff_t max_pos = 0;
> > +
> > +       list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
> > +               if (lseg->pls_range.iomode == IOMODE_RW) {
> > +                       if (max_pos < lseg->pls_end_pos)
> > +                               max_pos = lseg->pls_end_pos;
> > +                       if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT,
> &lseg->pls_flags))
> > +                               rv = lseg;
> > +               }
> > +       }
> > +       rv->pls_end_pos = max_pos;
> >
> 
> The idea here was that it could be extended to use segment by
> returning a list of affected lsegs,
> not so,e random one.  Because otherwise you have problems with the
> fact that relevant but not
> returned lsegs are going to get there refcounts messed up.
The above code relies on NFS_INO_LAYOUTCOMMIT bit to ensure that only one inode lseg has NFS_LSEG_LAYOUTCOMMIT set. But, you are right. The layoutcommit code needs a second thought.
How about making it return a list of affected lsegs and pass them around layoutcommit_procs?

Thanks,
Tao

> 
> Fred
> 
> > -       list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
> > -               if (lseg->pls_range.iomode == IOMODE_RW)
> > -                       rv = lseg;
> >        return rv;
> >  }
> >
> > @@ -1211,6 +1219,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
> >        if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
> >                /* references matched in nfs4_layoutcommit_release */
> >                get_lseg(wdata->lseg);
> > +               set_bit(NFS_LSEG_LAYOUTCOMMIT,
> &wdata->lseg->pls_flags);
> >                wdata->lseg->pls_lc_cred =
> >                        get_rpccred(wdata->args.context->state->owner-
> >so_cred);
> >                mark_as_dirty = true;
> > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> > index b071b56..a3fc0f2 100644
> > --- a/fs/nfs/pnfs.h
> > +++ b/fs/nfs/pnfs.h
> > @@ -36,6 +36,7 @@
> >  enum {
> >        NFS_LSEG_VALID = 0,     /* cleared when lseg is recalled/returned */
> >        NFS_LSEG_ROC,           /* roc bit received from server */
> > +       NFS_LSEG_LAYOUTCOMMIT,  /* layoutcommit bit set for
> layoutcommit */
> >  };
> >
> >  struct pnfs_layout_segment {
> > --
> > 1.7.4.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation
  2011-06-13 14:44   ` Fred Isaman
@ 2011-06-14 11:01     ` tao.peng
  2011-06-14 14:05       ` Fred Isaman
  0 siblings, 1 reply; 58+ messages in thread
From: tao.peng @ 2011-06-14 11:01 UTC (permalink / raw)
  To: iisaman, rees; +Cc: linux-nfs, honey

Hi, Fred,

> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org]
> On Behalf Of Fred Isaman
> Sent: Monday, June 13, 2011 10:44 PM
> To: Jim Rees
> Cc: linux-nfs@vger.kernel.org; peter honeyman
> Subject: Re: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver
> manipulation
>
> On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
> > From: Peng Tao <bergwolf@gmail.com>
> >
> > Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> > Reported-by: Alexandros Batsakis <batsakis@netapp.com>
> > Signed-off-by: Andy Adamson <andros@netapp.com>
> > Signed-off-by: Fred Isaman <iisaman@netapp.com>
> > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> > Signed-off-by: Peng Tao <bergwolf@gmail.com>
> > ---
> >  fs/nfs/file.c          |   26 ++++++++++-
> >  fs/nfs/pnfs.c          |   41 +++++++++++++++++
> >  fs/nfs/pnfs.h          |  115
> ++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/nfs/write.c         |   12 +++--
> >  include/linux/nfs_fs.h |    3 +-
> >  5 files changed, 189 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > index 2f093ed..1768762 100644
> > --- a/fs/nfs/file.c
> > +++ b/fs/nfs/file.c
> > @@ -384,12 +384,15 @@ static int nfs_write_begin(struct file *file, struct
> address_space *mapping,
> >        pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> >        struct page *page;
> >        int once_thru = 0;
> > +       struct pnfs_layout_segment *lseg;
> >
> >        dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
> >                file->f_path.dentry->d_parent->d_name.name,
> >                file->f_path.dentry->d_name.name,
> >                mapping->host->i_ino, len, (long long) pos);
> > -
> > +       lseg = pnfs_update_layout(mapping->host,
> > +                                 nfs_file_open_context(file),
> > +                                 pos, len, IOMODE_RW, GFP_NOFS);
>
>
> This looks like it is left over from before the rearrangements done to
> where pnfs_update_layout.
> In particular, we don't want to hold the reference on the lseg from
> here until flush time.  And there
> seems to be no reason to.  If the client needs a layout to deal with
> read-in here, it should instead
> trigger the nfs_want_read_modify_write clause.
Yes, you are right. Directly calling pnfs_update_layout here can be avoided.
But it seems triggering nfs_want_read_modify_write will acquire a read-only layout segment via readpage code path.
For write, client will need a read-write layout segment so it would mean two layoutget for each new segment (one in nfs_readpage and one at flush time). It may not be good for performance.
Does current generic code have method to avoid this?

Thanks,
Tao

>
> Fred
>
> >  start:
> >        /*
> >         * Prevent starvation issues if someone is doing a consistency
> > @@ -409,6 +412,9 @@ start:
> >        if (ret) {
> >                unlock_page(page);
> >                page_cache_release(page);
> > +               *pagep = NULL;
> > +               *fsdata = NULL;
> > +               goto out;
> >        } else if (!once_thru &&
> >                   nfs_want_read_modify_write(file, page, pos, len)) {
> >                once_thru = 1;
> > @@ -417,6 +423,12 @@ start:
> >                if (!ret)
> >                        goto start;
> >        }
> > +       ret = pnfs_write_begin(file, page, pos, len, lseg, fsdata);
> > + out:
> > +       if (ret) {
> > +               put_lseg(lseg);
> > +               *fsdata = NULL;
> > +       }
> >        return ret;
> >  }
> >
> > @@ -426,6 +438,7 @@ static int nfs_write_end(struct file *file, struct
> address_space *mapping,
> >  {
> >        unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
> >        int status;
> > +       struct pnfs_layout_segment *lseg;
> >
> >        dfprintk(PAGECACHE, "NFS: write_end(%s/%s(%ld), %u@%lld)\n",
> >                file->f_path.dentry->d_parent->d_name.name,
> > @@ -452,10 +465,17 @@ static int nfs_write_end(struct file *file, struct
> address_space *mapping,
> >                        zero_user_segment(page, pglen,
> PAGE_CACHE_SIZE);
> >        }
> >
> > -       status = nfs_updatepage(file, page, offset, copied);
> > +       lseg = nfs4_pull_lseg_from_fsdata(file, fsdata);
> > +       status = pnfs_write_end(file, page, pos, len, copied, lseg);
> > +       if (status)
> > +               goto out;
> > +       status = nfs_updatepage(file, page, offset, copied, lseg, fsdata);
> >
> > +out:
> >        unlock_page(page);
> >        page_cache_release(page);
> > +       pnfs_write_end_cleanup(file, fsdata);
> > +       put_lseg(lseg);
> >
> >        if (status < 0)
> >                return status;
> > @@ -577,7 +597,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct
> *vma, struct vm_fault *vmf)
> >
> >        ret = VM_FAULT_LOCKED;
> >        if (nfs_flush_incompatible(filp, page) == 0 &&
> > -           nfs_updatepage(filp, page, 0, pagelen) == 0)
> > +           nfs_updatepage(filp, page, 0, pagelen, NULL, NULL) == 0)
> >                goto out;
> >
> >        ret = VM_FAULT_SIGBUS;
> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> > index f03a5e0..e693718 100644
> > --- a/fs/nfs/pnfs.c
> > +++ b/fs/nfs/pnfs.c
> > @@ -1138,6 +1138,41 @@ pnfs_try_to_write_data(struct nfs_write_data
> *wdata,
> >  }
> >
> >  /*
> > + * This gives the layout driver an opportunity to read in page "around"
> > + * the data to be written.  It returns 0 on success, otherwise an error code
> > + * which will either be passed up to user, or ignored if
> > + * some previous part of write succeeded.
> > + * Note the range [pos, pos+len-1] is entirely within the page.
> > + */
> > +int _pnfs_write_begin(struct inode *inode, struct page *page,
> > +                     loff_t pos, unsigned len,
> > +                     struct pnfs_layout_segment *lseg,
> > +                     struct pnfs_fsdata **fsdata)
> > +{
> > +       struct pnfs_fsdata *data;
> > +       int status = 0;
> > +
> > +       dprintk("--> %s: pos=%llu len=%u\n",
> > +               __func__, (unsigned long long)pos, len);
> > +       data = kzalloc(sizeof(struct pnfs_fsdata), GFP_KERNEL);
> > +       if (!data) {
> > +               status = -ENOMEM;
> > +               goto out;
> > +       }
> > +       data->lseg = lseg; /* refcount passed into data to be managed there */
> > +       status = NFS_SERVER(inode)->pnfs_curr_ld->write_begin(
> > +                                               lseg, page, pos,
> len, data);
> > +       if (status) {
> > +               kfree(data);
> > +               data = NULL;
> > +       }
> > +out:
> > +       *fsdata = data;
> > +       dprintk("<-- %s: status=%d\n", __func__, status);
> > +       return status;
> > +}
> > +
> > +/*
> >  * Called by non rpc-based layout drivers
> >  */
> >  int
> > @@ -1237,6 +1272,12 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
> >  }
> >  EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
> >
> > +void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
> > +{
> > +       /* lseg refcounting handled directly in nfs_write_end */
> > +       kfree(fsdata);
> > +}
> > +
> >  /*
> >  * For the LAYOUT4_NFSV4_1_FILES layout type, NFS_DATA_SYNC WRITEs and
> >  * NFS_UNSTABLE WRITEs with a COMMIT to data servers must store enough
> > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> > index a3fc0f2..525ec55 100644
> > --- a/fs/nfs/pnfs.h
> > +++ b/fs/nfs/pnfs.h
> > @@ -54,6 +54,12 @@ enum pnfs_try_status {
> >        PNFS_NOT_ATTEMPTED = 1,
> >  };
> >
> > +struct pnfs_fsdata {
> > +       struct pnfs_layout_segment *lseg;
> > +       int bypass_eof;
> > +       void *private;
> > +};
> > +
> >  #ifdef CONFIG_NFS_V4_1
> >
> >  #define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
> > @@ -106,6 +112,14 @@ struct pnfs_layoutdriver_type {
> >         */
> >        enum pnfs_try_status (*read_pagelist) (struct nfs_read_data
> *nfs_data);
> >        enum pnfs_try_status (*write_pagelist) (struct nfs_write_data
> *nfs_data, int how);
> > +       int (*write_begin) (struct pnfs_layout_segment *lseg, struct page
> *page,
> > +                           loff_t pos, unsigned count,
> > +                           struct pnfs_fsdata *fsdata);
> > +       int (*write_end)(struct inode *inode, struct page *page, loff_t pos,
> > +                        unsigned count, unsigned copied,
> > +                        struct pnfs_layout_segment *lseg);
> > +       void (*write_end_cleanup)(struct file *filp,
> > +                                 struct pnfs_fsdata *fsdata);
> >
> >        void (*free_deviceid_node) (struct nfs4_deviceid_node *);
> >
> > @@ -175,6 +189,7 @@ enum pnfs_try_status pnfs_try_to_write_data(struct
> nfs_write_data *,
> >  enum pnfs_try_status pnfs_try_to_read_data(struct nfs_read_data *,
> >                                            const struct
> rpc_call_ops *);
> >  bool pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page
> *prev, struct nfs_page *req);
> > +void pnfs_free_fsdata(struct pnfs_fsdata *fsdata);
> >  int pnfs_layout_process(struct nfs4_layoutget *lgp);
> >  void pnfs_free_lseg_list(struct list_head *tmp_list);
> >  void pnfs_destroy_layout(struct nfs_inode *);
> > @@ -186,6 +201,10 @@ void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
> >  int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
> >                                  struct pnfs_layout_hdr *lo,
> >                                  struct nfs4_state *open_state);
> > +int _pnfs_write_begin(struct inode *inode, struct page *page,
> > +                     loff_t pos, unsigned len,
> > +                     struct pnfs_layout_segment *lseg,
> > +                     struct pnfs_fsdata **fsdata);
> >  int mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
> >                                struct list_head *tmp_list,
> >                                struct pnfs_layout_range
> *recall_range);
> > @@ -287,6 +306,13 @@ static inline void pnfs_clear_request_commit(struct
> nfs_page *req)
> >                put_lseg(req->wb_commit_lseg);
> >  }
> >
> > +static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
> > +                              struct pnfs_fsdata *fsdata)
> > +{
> > +       return !fsdata  || ((struct pnfs_layout_segment *)fsdata == lseg) ||
> > +               !fsdata->bypass_eof;
> > +}
> > +
> >  /* Should the pNFS client commit and return the layout upon a setattr */
> >  static inline bool
> >  pnfs_ld_layoutret_on_setattr(struct inode *inode)
> > @@ -297,6 +323,49 @@ pnfs_ld_layoutret_on_setattr(struct inode *inode)
> >                PNFS_LAYOUTRET_ON_SETATTR;
> >  }
> >
> > +static inline int pnfs_write_begin(struct file *filp, struct page *page,
> > +                                  loff_t pos, unsigned len,
> > +                                  struct pnfs_layout_segment *lseg,
> > +                                  void **fsdata)
> > +{
> > +       struct inode *inode = filp->f_dentry->d_inode;
> > +       struct nfs_server *nfss = NFS_SERVER(inode);
> > +       int status = 0;
> > +
> > +       *fsdata = lseg;
> > +       if (lseg && nfss->pnfs_curr_ld->write_begin)
> > +               status = _pnfs_write_begin(inode, page, pos, len, lseg,
> > +                                          (struct pnfs_fsdata **)
> fsdata);
> > +       return status;
> > +}
> > +
> > +/* CAREFUL - what happens if copied < len??? */
> > +static inline int pnfs_write_end(struct file *filp, struct page *page,
> > +                                loff_t pos, unsigned len, unsigned
> copied,
> > +                                struct pnfs_layout_segment *lseg)
> > +{
> > +       struct inode *inode = filp->f_dentry->d_inode;
> > +       struct nfs_server *nfss = NFS_SERVER(inode);
> > +
> > +       if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_end)
> > +               return nfss->pnfs_curr_ld->write_end(inode, page, pos, len,
> > +                                                    copied,
> lseg);
> > +       else
> > +               return 0;
> > +}
> > +
> > +static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
> > +{
> > +       struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
> > +
> > +       if (fsdata && nfss->pnfs_curr_ld) {
> > +               if (nfss->pnfs_curr_ld->write_end_cleanup)
> > +                       nfss->pnfs_curr_ld->write_end_cleanup(filp,
> fsdata);
> > +               if (nfss->pnfs_curr_ld->write_begin)
> > +                       pnfs_free_fsdata(fsdata);
> > +       }
> > +}
> > +
> >  static inline int pnfs_return_layout(struct inode *ino)
> >  {
> >        struct nfs_inode *nfsi = NFS_I(ino);
> > @@ -317,6 +386,19 @@ static inline void pnfs_pageio_init(struct
> nfs_pageio_descriptor *pgio,
> >                pgio->pg_test = ld->pg_test;
> >  }
> >
> > +static inline struct pnfs_layout_segment *
> > +nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
> > +{
> > +       if (fsdata) {
> > +               struct nfs_server *nfss =
> NFS_SERVER(filp->f_dentry->d_inode);
> > +
> > +               if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_begin)
> > +                       return ((struct pnfs_fsdata *) fsdata)->lseg;
> > +               return (struct pnfs_layout_segment *)fsdata;
> > +       }
> > +       return NULL;
> > +}
> > +
> >  #else  /* CONFIG_NFS_V4_1 */
> >
> >  static inline void pnfs_destroy_all_layouts(struct nfs_client *clp)
> > @@ -345,6 +427,12 @@ pnfs_update_layout(struct inode *ino, struct
> nfs_open_context *ctx,
> >        return NULL;
> >  }
> >
> > +static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
> > +                              struct pnfs_fsdata *fsdata)
> > +{
> > +       return 1;
> > +}
> > +
> >  static inline enum pnfs_try_status
> >  pnfs_try_to_read_data(struct nfs_read_data *data,
> >                      const struct rpc_call_ops *call_ops)
> > @@ -364,6 +452,26 @@ static inline int pnfs_return_layout(struct inode *ino)
> >        return 0;
> >  }
> >
> > +static inline int pnfs_write_begin(struct file *filp, struct page *page,
> > +                                  loff_t pos, unsigned len,
> > +                                  struct pnfs_layout_segment *lseg,
> > +                                  void **fsdata)
> > +{
> > +       *fsdata = NULL;
> > +       return 0;
> > +}
> > +
> > +static inline int pnfs_write_end(struct file *filp, struct page *page,
> > +                                loff_t pos, unsigned len, unsigned
> copied,
> > +                                struct pnfs_layout_segment *lseg)
> > +{
> > +       return 0;
> > +}
> > +
> > +static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
> > +{
> > +}
> > +
> >  static inline bool
> >  pnfs_ld_layoutret_on_setattr(struct inode *inode)
> >  {
> > @@ -435,6 +543,13 @@ static inline int pnfs_layoutcommit_inode(struct inode
> *inode, bool sync)
> >  static inline void nfs4_deviceid_purge_client(struct nfs_client *ncl)
> >  {
> >  }
> > +
> > +static inline struct pnfs_layout_segment *
> > +nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
> > +{
> > +       return NULL;
> > +}
> > +
> >  #endif /* CONFIG_NFS_V4_1 */
> >
> >  #endif /* FS_NFS_PNFS_H */
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index e268e3b..75e2a6b 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -673,7 +673,9 @@ out:
> >  }
> >
> >  static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
> > -               unsigned int offset, unsigned int count)
> > +               unsigned int offset, unsigned int count,
> > +               struct pnfs_layout_segment *lseg, void *fsdata)
> > +
> >  {
> >        struct nfs_page *req;
> >
> > @@ -681,7 +683,8 @@ static int nfs_writepage_setup(struct nfs_open_context
> *ctx, struct page *page,
> >        if (IS_ERR(req))
> >                return PTR_ERR(req);
> >        /* Update file length */
> > -       nfs_grow_file(page, offset, count);
> > +       if (pnfs_grow_ok(lseg, fsdata))
> > +               nfs_grow_file(page, offset, count);
> >        nfs_mark_uptodate(page, req->wb_pgbase, req->wb_bytes);
> >        nfs_mark_request_dirty(req);
> >        nfs_clear_page_tag_locked(req);
> > @@ -734,7 +737,8 @@ static int nfs_write_pageuptodate(struct page *page,
> struct inode *inode)
> >  * things with a page scheduled for an RPC call (e.g. invalidate it).
> >  */
> >  int nfs_updatepage(struct file *file, struct page *page,
> > -               unsigned int offset, unsigned int count)
> > +               unsigned int offset, unsigned int count,
> > +               struct pnfs_layout_segment *lseg, void *fsdata)
> >  {
> >        struct nfs_open_context *ctx = nfs_file_open_context(file);
> >        struct inode    *inode = page->mapping->host;
> > @@ -759,7 +763,7 @@ int nfs_updatepage(struct file *file, struct page *page,
> >                offset = 0;
> >        }
> >
> > -       status = nfs_writepage_setup(ctx, page, offset, count);
> > +       status = nfs_writepage_setup(ctx, page, offset, count, lseg, fsdata);
> >        if (status < 0)
> >                nfs_set_pageerror(page);
> >
> > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > index 1b93b9c..be1ac1d 100644
> > --- a/include/linux/nfs_fs.h
> > +++ b/include/linux/nfs_fs.h
> > @@ -510,7 +510,8 @@ extern int  nfs_congestion_kb;
> >  extern int  nfs_writepage(struct page *page, struct writeback_control *wbc);
> >  extern int  nfs_writepages(struct address_space *, struct writeback_control *);
> >  extern int  nfs_flush_incompatible(struct file *file, struct page *page);
> > -extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned
> int);
> > +extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned
> int,
> > +                       struct pnfs_layout_segment *, void *);
> >  extern void nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
> >
> >  /*
> > --
> > 1.7.4.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments
  2011-06-14 10:40     ` tao.peng
@ 2011-06-14 13:58       ` Fred Isaman
  2011-06-14 14:28       ` Benny Halevy
  1 sibling, 0 replies; 58+ messages in thread
From: Fred Isaman @ 2011-06-14 13:58 UTC (permalink / raw)
  To: tao.peng; +Cc: rees, linux-nfs, honey

On Tue, Jun 14, 2011 at 6:40 AM,  <tao.peng@emc.com> wrote:
> Hi, Fred,
>
>> -----Original Message-----
>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org]
>> On Behalf Of Fred Isaman
>> Sent: Monday, June 13, 2011 10:37 PM
>> To: Jim Rees
>> Cc: linux-nfs@vger.kernel.org; peter honeyman
>> Subject: Re: [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments
>>
>> On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
>> > From: Peng Tao <bergwolf@gmail.com>
>> >
>> > Some layout driver like block will have multiple segments.
>> > Generic code should be able to handle it.
>> >
>> > Signed-off-by: Peng Tao <peng_tao@emc.com>
>> > ---
>> >  fs/nfs/pnfs.c |   17 +++++++++++++----
>> >  fs/nfs/pnfs.h |    1 +
>> >  2 files changed, 14 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> > index e3d618b..f03a5e0 100644
>> > --- a/fs/nfs/pnfs.c
>> > +++ b/fs/nfs/pnfs.c
>> > @@ -892,7 +892,7 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
>> >        dprintk("%s:Begin\n", __func__);
>> >
>> >        assert_spin_locked(&lo->plh_inode->i_lock);
>> > -       list_for_each_entry(lseg, &lo->plh_segs, pls_list) {
>> > +       list_for_each_entry_reverse(lseg, &lo->plh_segs, pls_list) {
>>
>> This is a sortred list, and the order of search matters.  You can't
>> just reverse it here.
> The layout segment list is in offset increasing order. But the lookup code here assumes it's a decreasing ordered list.
> To fix it, we should either reverse lookup the list, or change the break condition test. Otherwise lookup always fails if not matching the first one.

I agree there is a problem here that affects the generic code.  I've
just sent a separate patch that deals with that.

Fred

>
>>
>> >                if (test_bit(NFS_LSEG_VALID, &lseg->pls_flags) &&
>> >                    is_matching_lseg(&lseg->pls_range, range)) {
>> >                        ret = get_lseg(lseg);
>> > @@ -1193,10 +1193,18 @@ pnfs_try_to_read_data(struct nfs_read_data *rdata,
>> >  static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
>> >  {
>> >        struct pnfs_layout_segment *lseg, *rv = NULL;
>> > +       loff_t max_pos = 0;
>> > +
>> > +       list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
>> > +               if (lseg->pls_range.iomode == IOMODE_RW) {
>> > +                       if (max_pos < lseg->pls_end_pos)
>> > +                               max_pos = lseg->pls_end_pos;
>> > +                       if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT,
>> &lseg->pls_flags))
>> > +                               rv = lseg;
>> > +               }
>> > +       }
>> > +       rv->pls_end_pos = max_pos;
>> >
>>
>> The idea here was that it could be extended to use segment by
>> returning a list of affected lsegs,
>> not so,e random one.  Because otherwise you have problems with the
>> fact that relevant but not
>> returned lsegs are going to get there refcounts messed up.
> The above code relies on NFS_INO_LAYOUTCOMMIT bit to ensure that only one inode lseg has NFS_LSEG_LAYOUTCOMMIT set. But, you are right. The layoutcommit code needs a second thought.
> How about making it return a list of affected lsegs and pass them around layoutcommit_procs?
>
> Thanks,
> Tao
>
>>
>> Fred
>>
>> > -       list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
>> > -               if (lseg->pls_range.iomode == IOMODE_RW)
>> > -                       rv = lseg;
>> >        return rv;
>> >  }
>> >
>> > @@ -1211,6 +1219,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>> >        if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
>> >                /* references matched in nfs4_layoutcommit_release */
>> >                get_lseg(wdata->lseg);
>> > +               set_bit(NFS_LSEG_LAYOUTCOMMIT,
>> &wdata->lseg->pls_flags);
>> >                wdata->lseg->pls_lc_cred =
>> >                        get_rpccred(wdata->args.context->state->owner-
>> >so_cred);
>> >                mark_as_dirty = true;
>> > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> > index b071b56..a3fc0f2 100644
>> > --- a/fs/nfs/pnfs.h
>> > +++ b/fs/nfs/pnfs.h
>> > @@ -36,6 +36,7 @@
>> >  enum {
>> >        NFS_LSEG_VALID = 0,     /* cleared when lseg is recalled/returned */
>> >        NFS_LSEG_ROC,           /* roc bit received from server */
>> > +       NFS_LSEG_LAYOUTCOMMIT,  /* layoutcommit bit set for
>> layoutcommit */
>> >  };
>> >
>> >  struct pnfs_layout_segment {
>> > --
>> > 1.7.4.1
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation
  2011-06-14 11:01     ` tao.peng
@ 2011-06-14 14:05       ` Fred Isaman
  2011-06-14 15:53         ` Peng Tao
  0 siblings, 1 reply; 58+ messages in thread
From: Fred Isaman @ 2011-06-14 14:05 UTC (permalink / raw)
  To: tao.peng; +Cc: rees, linux-nfs, honey

On Tue, Jun 14, 2011 at 7:01 AM,  <tao.peng@emc.com> wrote:
> Hi, Fred,
>
>> -----Original Message-----
>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org]
>> On Behalf Of Fred Isaman
>> Sent: Monday, June 13, 2011 10:44 PM
>> To: Jim Rees
>> Cc: linux-nfs@vger.kernel.org; peter honeyman
>> Subject: Re: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver
>> manipulation
>>
>> On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
>> > From: Peng Tao <bergwolf@gmail.com>
>> >
>> > Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>> > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>> > Reported-by: Alexandros Batsakis <batsakis@netapp.com>
>> > Signed-off-by: Andy Adamson <andros@netapp.com>
>> > Signed-off-by: Fred Isaman <iisaman@netapp.com>
>> > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>> > Signed-off-by: Peng Tao <bergwolf@gmail.com>
>> > ---
>> >  fs/nfs/file.c          |   26 ++++++++++-
>> >  fs/nfs/pnfs.c          |   41 +++++++++++++++++
>> >  fs/nfs/pnfs.h          |  115
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>> >  fs/nfs/write.c         |   12 +++--
>> >  include/linux/nfs_fs.h |    3 +-
>> >  5 files changed, 189 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > index 2f093ed..1768762 100644
>> > --- a/fs/nfs/file.c
>> > +++ b/fs/nfs/file.c
>> > @@ -384,12 +384,15 @@ static int nfs_write_begin(struct file *file, struct
>> address_space *mapping,
>> >        pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>> >        struct page *page;
>> >        int once_thru = 0;
>> > +       struct pnfs_layout_segment *lseg;
>> >
>> >        dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
>> >                file->f_path.dentry->d_parent->d_name.name,
>> >                file->f_path.dentry->d_name.name,
>> >                mapping->host->i_ino, len, (long long) pos);
>> > -
>> > +       lseg = pnfs_update_layout(mapping->host,
>> > +                                 nfs_file_open_context(file),
>> > +                                 pos, len, IOMODE_RW, GFP_NOFS);
>>
>>
>> This looks like it is left over from before the rearrangements done to
>> where pnfs_update_layout.
>> In particular, we don't want to hold the reference on the lseg from
>> here until flush time.  And there
>> seems to be no reason to.  If the client needs a layout to deal with
>> read-in here, it should instead
>> trigger the nfs_want_read_modify_write clause.
> Yes, you are right. Directly calling pnfs_update_layout here can be avoided.
> But it seems triggering nfs_want_read_modify_write will acquire a read-only layout segment via readpage code path.
> For write, client will need a read-write layout segment so it would mean two layoutget for each new segment (one in nfs_readpage and one at flush time). It may not be good for performance.
> Does current generic code have method to avoid this?
>
> Thanks,
> Tao
>

No.  However, note that this only hits in the case where you are doing
subpage writes.

Fred

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments
  2011-06-14 10:40     ` tao.peng
  2011-06-14 13:58       ` Fred Isaman
@ 2011-06-14 14:28       ` Benny Halevy
  1 sibling, 0 replies; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 14:28 UTC (permalink / raw)
  To: tao.peng; +Cc: iisaman, rees, linux-nfs, honey

On 2011-06-14 06:40, tao.peng@emc.com wrote:
> Hi, Fred,
> 
>> -----Original Message-----
>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org]
>> On Behalf Of Fred Isaman
>> Sent: Monday, June 13, 2011 10:37 PM
>> To: Jim Rees
>> Cc: linux-nfs@vger.kernel.org; peter honeyman
>> Subject: Re: [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments
>>
>> On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
>>> From: Peng Tao <bergwolf@gmail.com>
>>>
>>> Some layout driver like block will have multiple segments.
>>> Generic code should be able to handle it.
>>>
>>> Signed-off-by: Peng Tao <peng_tao@emc.com>
>>> ---
>>>  fs/nfs/pnfs.c |   17 +++++++++++++----
>>>  fs/nfs/pnfs.h |    1 +
>>>  2 files changed, 14 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>>> index e3d618b..f03a5e0 100644
>>> --- a/fs/nfs/pnfs.c
>>> +++ b/fs/nfs/pnfs.c
>>> @@ -892,7 +892,7 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
>>>        dprintk("%s:Begin\n", __func__);
>>>
>>>        assert_spin_locked(&lo->plh_inode->i_lock);
>>> -       list_for_each_entry(lseg, &lo->plh_segs, pls_list) {
>>> +       list_for_each_entry_reverse(lseg, &lo->plh_segs, pls_list) {
>>
>> This is a sortred list, and the order of search matters.  You can't
>> just reverse it here.
> The layout segment list is in offset increasing order. But the lookup code here assumes it's a decreasing ordered list.
> To fix it, we should either reverse lookup the list, or change the break condition test. Otherwise lookup always fails if not matching the first one.
> 

We shouldn't scan the list in reverse.
I'll send a fix upstream to fix the break condition.
This got broken when I last changed cmp_layout.
Basically we want to break out of the loop once we can't find a layout covering the first
byte of the range we're looking up.

Benny

>>
>>>                if (test_bit(NFS_LSEG_VALID, &lseg->pls_flags) &&
>>>                    is_matching_lseg(&lseg->pls_range, range)) {
>>>                        ret = get_lseg(lseg);
>>> @@ -1193,10 +1193,18 @@ pnfs_try_to_read_data(struct nfs_read_data *rdata,
>>>  static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
>>>  {
>>>        struct pnfs_layout_segment *lseg, *rv = NULL;
>>> +       loff_t max_pos = 0;
>>> +
>>> +       list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
>>> +               if (lseg->pls_range.iomode == IOMODE_RW) {
>>> +                       if (max_pos < lseg->pls_end_pos)
>>> +                               max_pos = lseg->pls_end_pos;
>>> +                       if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT,
>> &lseg->pls_flags))
>>> +                               rv = lseg;
>>> +               }
>>> +       }
>>> +       rv->pls_end_pos = max_pos;
>>>
>>
>> The idea here was that it could be extended to use segment by
>> returning a list of affected lsegs,
>> not so,e random one.  Because otherwise you have problems with the
>> fact that relevant but not
>> returned lsegs are going to get there refcounts messed up.
> The above code relies on NFS_INO_LAYOUTCOMMIT bit to ensure that only one inode lseg has NFS_LSEG_LAYOUTCOMMIT set. But, you are right. The layoutcommit code needs a second thought.
> How about making it return a list of affected lsegs and pass them around layoutcommit_procs?
> 
> Thanks,
> Tao
> 
>>
>> Fred
>>
>>> -       list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
>>> -               if (lseg->pls_range.iomode == IOMODE_RW)
>>> -                       rv = lseg;
>>>        return rv;
>>>  }
>>>
>>> @@ -1211,6 +1219,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>>>        if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
>>>                /* references matched in nfs4_layoutcommit_release */
>>>                get_lseg(wdata->lseg);
>>> +               set_bit(NFS_LSEG_LAYOUTCOMMIT,
>> &wdata->lseg->pls_flags);
>>>                wdata->lseg->pls_lc_cred =
>>>                        get_rpccred(wdata->args.context->state->owner-
>>> so_cred);
>>>                mark_as_dirty = true;
>>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>>> index b071b56..a3fc0f2 100644
>>> --- a/fs/nfs/pnfs.h
>>> +++ b/fs/nfs/pnfs.h
>>> @@ -36,6 +36,7 @@
>>>  enum {
>>>        NFS_LSEG_VALID = 0,     /* cleared when lseg is recalled/returned */
>>>        NFS_LSEG_ROC,           /* roc bit received from server */
>>> +       NFS_LSEG_LAYOUTCOMMIT,  /* layoutcommit bit set for
>> layoutcommit */
>>>  };
>>>
>>>  struct pnfs_layout_segment {
>>> --
>>> 1.7.4.1
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/34] pnfs: ask for layout_blksize and save it in nfs_server
  2011-06-12 23:43 ` [PATCH 05/34] pnfs: ask for layout_blksize and save it in nfs_server Jim Rees
@ 2011-06-14 15:01   ` Benny Halevy
  2011-06-14 15:08     ` Peng Tao
  0 siblings, 1 reply; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 15:01 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On 2011-06-12 19:43, Jim Rees wrote:
> From: Peng Tao <bergwolf@gmail.com>

Jim, please revise the authorshop of the different patches.
This for example, was originally authored by
Fred Isaman <iisaman@citi.umich.edu> (see commit 5c5a76f)

Benny

> 
> Block layout needs it to determine IO size.
> 
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> Signed-off-by: Tao Guo <glorioustao@gmail.com>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> Signed-off-by: Peng Tao <bergwolf@gmail.com>
> ---
>  fs/nfs/client.c           |    1 +
>  fs/nfs/nfs4_fs.h          |    2 +-
>  fs/nfs/nfs4proc.c         |    5 +-
>  fs/nfs/nfs4xdr.c          |  101 +++++++++++++++++++++++++++++++++++++--------
>  include/linux/nfs_fs_sb.h |    4 +-
>  include/linux/nfs_xdr.h   |    3 +-
>  6 files changed, 93 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index 6bdb7da0..b2c6920 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -937,6 +937,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fh *mntf
>  	if (server->wsize > NFS_MAX_FILE_IO_SIZE)
>  		server->wsize = NFS_MAX_FILE_IO_SIZE;
>  	server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> +	server->pnfs_blksize = fsinfo->blksize;
>  	set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);
>  
>  	server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);
> diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
> index c4a6983..5725a7e 100644
> --- a/fs/nfs/nfs4_fs.h
> +++ b/fs/nfs/nfs4_fs.h
> @@ -315,7 +315,7 @@ extern const struct nfs4_minor_version_ops *nfs_v4_minor_ops[];
>  extern const u32 nfs4_fattr_bitmap[2];
>  extern const u32 nfs4_statfs_bitmap[2];
>  extern const u32 nfs4_pathconf_bitmap[2];
> -extern const u32 nfs4_fsinfo_bitmap[2];
> +extern const u32 nfs4_fsinfo_bitmap[3];
>  extern const u32 nfs4_fs_locations_bitmap[2];
>  
>  /* nfs4renewd.c */
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 4a5ad93..5246db8 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -137,12 +137,13 @@ const u32 nfs4_pathconf_bitmap[2] = {
>  	0
>  };
>  
> -const u32 nfs4_fsinfo_bitmap[2] = { FATTR4_WORD0_MAXFILESIZE
> +const u32 nfs4_fsinfo_bitmap[3] = { FATTR4_WORD0_MAXFILESIZE
>  			| FATTR4_WORD0_MAXREAD
>  			| FATTR4_WORD0_MAXWRITE
>  			| FATTR4_WORD0_LEASE_TIME,
>  			FATTR4_WORD1_TIME_DELTA
> -			| FATTR4_WORD1_FS_LAYOUT_TYPES
> +			| FATTR4_WORD1_FS_LAYOUT_TYPES,
> +			FATTR4_WORD2_LAYOUT_BLKSIZE
>  };
>  
>  const u32 nfs4_fs_locations_bitmap[2] = {
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index 3620c45..fdcbd8f 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -91,7 +91,7 @@ static int nfs4_stat_to_errno(int);
>  #define encode_getfh_maxsz      (op_encode_hdr_maxsz)
>  #define decode_getfh_maxsz      (op_decode_hdr_maxsz + 1 + \
>  				((3+NFS4_FHSIZE) >> 2))
> -#define nfs4_fattr_bitmap_maxsz 3
> +#define nfs4_fattr_bitmap_maxsz 4
>  #define encode_getattr_maxsz    (op_encode_hdr_maxsz + nfs4_fattr_bitmap_maxsz)
>  #define nfs4_name_maxsz		(1 + ((3 + NFS4_MAXNAMLEN) >> 2))
>  #define nfs4_path_maxsz		(1 + ((3 + NFS4_MAXPATHLEN) >> 2))
> @@ -113,7 +113,11 @@ static int nfs4_stat_to_errno(int);
>  #define encode_restorefh_maxsz  (op_encode_hdr_maxsz)
>  #define decode_restorefh_maxsz  (op_decode_hdr_maxsz)
>  #define encode_fsinfo_maxsz	(encode_getattr_maxsz)
> -#define decode_fsinfo_maxsz	(op_decode_hdr_maxsz + 15)
> +/* The 5 accounts for the PNFS attributes, and assumes that at most three
> + * layout types will be returned.
> + */
> +#define decode_fsinfo_maxsz	(op_decode_hdr_maxsz + \
> +				 nfs4_fattr_bitmap_maxsz + 4 + 8 + 5)
>  #define encode_renew_maxsz	(op_encode_hdr_maxsz + 3)
>  #define decode_renew_maxsz	(op_decode_hdr_maxsz)
>  #define encode_setclientid_maxsz \
> @@ -1095,6 +1099,35 @@ static void encode_getattr_two(struct xdr_stream *xdr, uint32_t bm0, uint32_t bm
>  	hdr->replen += decode_getattr_maxsz;
>  }
>  
> +static void
> +encode_getattr_three(struct xdr_stream *xdr,
> +		     uint32_t bm0, uint32_t bm1, uint32_t bm2,
> +		     struct compound_hdr *hdr)
> +{
> +	__be32 *p;
> +
> +	p = reserve_space(xdr, 4);
> +	*p = cpu_to_be32(OP_GETATTR);
> +	if (bm2) {
> +		p = reserve_space(xdr, 16);
> +		*p++ = cpu_to_be32(3);
> +		*p++ = cpu_to_be32(bm0);
> +		*p++ = cpu_to_be32(bm1);
> +		*p = cpu_to_be32(bm2);
> +	} else if (bm1) {
> +		p = reserve_space(xdr, 12);
> +		*p++ = cpu_to_be32(2);
> +		*p++ = cpu_to_be32(bm0);
> +		*p = cpu_to_be32(bm1);
> +	} else {
> +		p = reserve_space(xdr, 8);
> +		*p++ = cpu_to_be32(1);
> +		*p = cpu_to_be32(bm0);
> +	}
> +	hdr->nops++;
> +	hdr->replen += decode_getattr_maxsz;
> +}
> +
>  static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
>  {
>  	encode_getattr_two(xdr, bitmask[0] & nfs4_fattr_bitmap[0],
> @@ -1103,8 +1136,11 @@ static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct c
>  
>  static void encode_fsinfo(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
>  {
> -	encode_getattr_two(xdr, bitmask[0] & nfs4_fsinfo_bitmap[0],
> -			   bitmask[1] & nfs4_fsinfo_bitmap[1], hdr);
> +	encode_getattr_three(xdr,
> +			     bitmask[0] & nfs4_fsinfo_bitmap[0],
> +			     bitmask[1] & nfs4_fsinfo_bitmap[1],
> +			     bitmask[2] & nfs4_fsinfo_bitmap[2],
> +			     hdr);
>  }
>  
>  static void encode_fs_locations(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
> @@ -2575,7 +2611,7 @@ static void nfs4_xdr_enc_setclientid_confirm(struct rpc_rqst *req,
>  	struct compound_hdr hdr = {
>  		.nops	= 0,
>  	};
> -	const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
> +	const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };
>  
>  	encode_compound_hdr(xdr, req, &hdr);
>  	encode_setclientid_confirm(xdr, arg, &hdr);
> @@ -2719,7 +2755,7 @@ static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req,
>  	struct compound_hdr hdr = {
>  		.minorversion = nfs4_xdr_minorversion(&args->la_seq_args),
>  	};
> -	const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
> +	const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };
>  
>  	encode_compound_hdr(xdr, req, &hdr);
>  	encode_sequence(xdr, &args->la_seq_args, &hdr);
> @@ -2947,14 +2983,17 @@ static int decode_attr_bitmap(struct xdr_stream *xdr, uint32_t *bitmap)
>  		goto out_overflow;
>  	bmlen = be32_to_cpup(p);
>  
> -	bitmap[0] = bitmap[1] = 0;
> +	bitmap[0] = bitmap[1] = bitmap[2] = 0;
>  	p = xdr_inline_decode(xdr, (bmlen << 2));
>  	if (unlikely(!p))
>  		goto out_overflow;
>  	if (bmlen > 0) {
>  		bitmap[0] = be32_to_cpup(p++);
> -		if (bmlen > 1)
> -			bitmap[1] = be32_to_cpup(p);
> +		if (bmlen > 1) {
> +			bitmap[1] = be32_to_cpup(p++);
> +			if (bmlen > 2)
> +				bitmap[2] = be32_to_cpup(p);
> +		}
>  	}
>  	return 0;
>  out_overflow:
> @@ -2986,8 +3025,9 @@ static int decode_attr_supported(struct xdr_stream *xdr, uint32_t *bitmap, uint3
>  			return ret;
>  		bitmap[0] &= ~FATTR4_WORD0_SUPPORTED_ATTRS;
>  	} else
> -		bitmask[0] = bitmask[1] = 0;
> -	dprintk("%s: bitmask=%08x:%08x\n", __func__, bitmask[0], bitmask[1]);
> +		bitmask[0] = bitmask[1] = bitmask[2] = 0;
> +	dprintk("%s: bitmask=%08x:%08x:%08x\n", __func__,
> +		bitmask[0], bitmask[1], bitmask[2]);
>  	return 0;
>  }
>  
> @@ -4041,7 +4081,7 @@ out_overflow:
>  static int decode_server_caps(struct xdr_stream *xdr, struct nfs4_server_caps_res *res)
>  {
>  	__be32 *savep;
> -	uint32_t attrlen, bitmap[2] = {0};
> +	uint32_t attrlen, bitmap[3] = {0};
>  	int status;
>  
>  	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
> @@ -4067,7 +4107,7 @@ xdr_error:
>  static int decode_statfs(struct xdr_stream *xdr, struct nfs_fsstat *fsstat)
>  {
>  	__be32 *savep;
> -	uint32_t attrlen, bitmap[2] = {0};
> +	uint32_t attrlen, bitmap[3] = {0};
>  	int status;
>  
>  	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
> @@ -4099,7 +4139,7 @@ xdr_error:
>  static int decode_pathconf(struct xdr_stream *xdr, struct nfs_pathconf *pathconf)
>  {
>  	__be32 *savep;
> -	uint32_t attrlen, bitmap[2] = {0};
> +	uint32_t attrlen, bitmap[3] = {0};
>  	int status;
>  
>  	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
> @@ -4239,7 +4279,7 @@ static int decode_getfattr_generic(struct xdr_stream *xdr, struct nfs_fattr *fat
>  {
>  	__be32 *savep;
>  	uint32_t attrlen,
> -		 bitmap[2] = {0};
> +		 bitmap[3] = {0};
>  	int status;
>  
>  	status = decode_op_hdr(xdr, OP_GETATTR);
> @@ -4325,10 +4365,32 @@ static int decode_attr_pnfstype(struct xdr_stream *xdr, uint32_t *bitmap,
>  	return status;
>  }
>  
> +/*
> + * The prefered block size for layout directed io
> + */
> +static int decode_attr_layout_blksize(struct xdr_stream *xdr, uint32_t *bitmap,
> +				      uint32_t *res)
> +{
> +	__be32 *p;
> +
> +	dprintk("%s: bitmap is %x\n", __func__, bitmap[2]);
> +	*res = 0;
> +	if (bitmap[2] & FATTR4_WORD2_LAYOUT_BLKSIZE) {
> +		p = xdr_inline_decode(xdr, 4);
> +		if (unlikely(!p)) {
> +			print_overflow_msg(__func__, xdr);
> +			return -EIO;
> +		}
> +		*res = be32_to_cpup(p);
> +		bitmap[2] &= ~FATTR4_WORD2_LAYOUT_BLKSIZE;
> +	}
> +	return 0;
> +}
> +
>  static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
>  {
>  	__be32 *savep;
> -	uint32_t attrlen, bitmap[2];
> +	uint32_t attrlen, bitmap[3];
>  	int status;
>  
>  	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
> @@ -4356,6 +4418,9 @@ static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
>  	status = decode_attr_pnfstype(xdr, bitmap, &fsinfo->layouttype);
>  	if (status != 0)
>  		goto xdr_error;
> +	status = decode_attr_layout_blksize(xdr, bitmap, &fsinfo->blksize);
> +	if (status)
> +		goto xdr_error;
>  
>  	status = verify_attr_len(xdr, savep, attrlen);
>  xdr_error:
> @@ -4775,7 +4840,7 @@ static int decode_getacl(struct xdr_stream *xdr, struct rpc_rqst *req,
>  {
>  	__be32 *savep;
>  	uint32_t attrlen,
> -		 bitmap[2] = {0};
> +		 bitmap[3] = {0};
>  	struct kvec *iov = req->rq_rcv_buf.head;
>  	int status;
>  
> @@ -6605,7 +6670,7 @@ out:
>  int nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry,
>  		       int plus)
>  {
> -	uint32_t bitmap[2] = {0};
> +	uint32_t bitmap[3] = {0};
>  	uint32_t len;
>  	__be32 *p = xdr_inline_decode(xdr, 4);
>  	if (unlikely(!p))
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index 87694ca..79cc4ca 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -130,7 +130,7 @@ struct nfs_server {
>  #endif
>  
>  #ifdef CONFIG_NFS_V4
> -	u32			attr_bitmask[2];/* V4 bitmask representing the set
> +	u32			attr_bitmask[3];/* V4 bitmask representing the set
>  						   of attributes supported on this
>  						   filesystem */
>  	u32			cache_consistency_bitmask[2];
> @@ -143,6 +143,8 @@ struct nfs_server {
>  						   filesystem */
>  	struct pnfs_layoutdriver_type  *pnfs_curr_ld; /* Active layout driver */
>  	struct rpc_wait_queue	roc_rpcwaitq;
> +	void			*pnfs_ld_data; /* per mount point data */
> +	u32			pnfs_blksize; /* layout_blksize attr */
>  
>  	/* the following fields are protected by nfs_client->cl_lock */
>  	struct rb_root		state_owners;
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 00442f5..a9c43ba 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -122,6 +122,7 @@ struct nfs_fsinfo {
>  	struct timespec		time_delta; /* server time granularity */
>  	__u32			lease_time; /* in seconds */
>  	__u32			layouttype; /* supported pnfs layout driver */
> +	__u32			blksize; /* preferred pnfs io block size */
>  };
>  
>  struct nfs_fsstat {
> @@ -954,7 +955,7 @@ struct nfs4_server_caps_arg {
>  };
>  
>  struct nfs4_server_caps_res {
> -	u32				attr_bitmask[2];
> +	u32				attr_bitmask[3];
>  	u32				acl_bitmask;
>  	u32				has_links;
>  	u32				has_symlinks;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/34] pnfs: ask for layout_blksize and save it in nfs_server
  2011-06-14 15:01   ` Benny Halevy
@ 2011-06-14 15:08     ` Peng Tao
  0 siblings, 0 replies; 58+ messages in thread
From: Peng Tao @ 2011-06-14 15:08 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Jim Rees, linux-nfs, peter honeyman

On Tue, Jun 14, 2011 at 11:01 PM, Benny Halevy <bhalevy.lists@gmail.com> wrote:
> On 2011-06-12 19:43, Jim Rees wrote:
>> From: Peng Tao <bergwolf@gmail.com>
>
> Jim, please revise the authorshop of the different patches.
> This for example, was originally authored by
> Fred Isaman <iisaman@citi.umich.edu> (see commit 5c5a76f)
Sorry, I kind of messed it up when squashing the patchset by hand...

Jim, please also revise the following patches' author. They are not
written by me...
 pnfs: hook nfs_write_begin/end to allow layout driver manipulation
 pnfs: ask for layout_blksize and save it in nfs_server
 pnfs: cleanup_layoutcommit

Thanks,
Tao

>
> Benny
>
>>
>> Block layout needs it to determine IO size.
>>
>> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>> Signed-off-by: Tao Guo <glorioustao@gmail.com>
>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>> Signed-off-by: Peng Tao <bergwolf@gmail.com>
>> ---
>>  fs/nfs/client.c           |    1 +
>>  fs/nfs/nfs4_fs.h          |    2 +-
>>  fs/nfs/nfs4proc.c         |    5 +-
>>  fs/nfs/nfs4xdr.c          |  101 +++++++++++++++++++++++++++++++++++++--------
>>  include/linux/nfs_fs_sb.h |    4 +-
>>  include/linux/nfs_xdr.h   |    3 +-
>>  6 files changed, 93 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
>> index 6bdb7da0..b2c6920 100644
>> --- a/fs/nfs/client.c
>> +++ b/fs/nfs/client.c
>> @@ -937,6 +937,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fh *mntf
>>       if (server->wsize > NFS_MAX_FILE_IO_SIZE)
>>               server->wsize = NFS_MAX_FILE_IO_SIZE;
>>       server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>> +     server->pnfs_blksize = fsinfo->blksize;
>>       set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);
>>
>>       server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);
>> diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
>> index c4a6983..5725a7e 100644
>> --- a/fs/nfs/nfs4_fs.h
>> +++ b/fs/nfs/nfs4_fs.h
>> @@ -315,7 +315,7 @@ extern const struct nfs4_minor_version_ops *nfs_v4_minor_ops[];
>>  extern const u32 nfs4_fattr_bitmap[2];
>>  extern const u32 nfs4_statfs_bitmap[2];
>>  extern const u32 nfs4_pathconf_bitmap[2];
>> -extern const u32 nfs4_fsinfo_bitmap[2];
>> +extern const u32 nfs4_fsinfo_bitmap[3];
>>  extern const u32 nfs4_fs_locations_bitmap[2];
>>
>>  /* nfs4renewd.c */
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index 4a5ad93..5246db8 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -137,12 +137,13 @@ const u32 nfs4_pathconf_bitmap[2] = {
>>       0
>>  };
>>
>> -const u32 nfs4_fsinfo_bitmap[2] = { FATTR4_WORD0_MAXFILESIZE
>> +const u32 nfs4_fsinfo_bitmap[3] = { FATTR4_WORD0_MAXFILESIZE
>>                       | FATTR4_WORD0_MAXREAD
>>                       | FATTR4_WORD0_MAXWRITE
>>                       | FATTR4_WORD0_LEASE_TIME,
>>                       FATTR4_WORD1_TIME_DELTA
>> -                     | FATTR4_WORD1_FS_LAYOUT_TYPES
>> +                     | FATTR4_WORD1_FS_LAYOUT_TYPES,
>> +                     FATTR4_WORD2_LAYOUT_BLKSIZE
>>  };
>>
>>  const u32 nfs4_fs_locations_bitmap[2] = {
>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
>> index 3620c45..fdcbd8f 100644
>> --- a/fs/nfs/nfs4xdr.c
>> +++ b/fs/nfs/nfs4xdr.c
>> @@ -91,7 +91,7 @@ static int nfs4_stat_to_errno(int);
>>  #define encode_getfh_maxsz      (op_encode_hdr_maxsz)
>>  #define decode_getfh_maxsz      (op_decode_hdr_maxsz + 1 + \
>>                               ((3+NFS4_FHSIZE) >> 2))
>> -#define nfs4_fattr_bitmap_maxsz 3
>> +#define nfs4_fattr_bitmap_maxsz 4
>>  #define encode_getattr_maxsz    (op_encode_hdr_maxsz + nfs4_fattr_bitmap_maxsz)
>>  #define nfs4_name_maxsz              (1 + ((3 + NFS4_MAXNAMLEN) >> 2))
>>  #define nfs4_path_maxsz              (1 + ((3 + NFS4_MAXPATHLEN) >> 2))
>> @@ -113,7 +113,11 @@ static int nfs4_stat_to_errno(int);
>>  #define encode_restorefh_maxsz  (op_encode_hdr_maxsz)
>>  #define decode_restorefh_maxsz  (op_decode_hdr_maxsz)
>>  #define encode_fsinfo_maxsz  (encode_getattr_maxsz)
>> -#define decode_fsinfo_maxsz  (op_decode_hdr_maxsz + 15)
>> +/* The 5 accounts for the PNFS attributes, and assumes that at most three
>> + * layout types will be returned.
>> + */
>> +#define decode_fsinfo_maxsz  (op_decode_hdr_maxsz + \
>> +                              nfs4_fattr_bitmap_maxsz + 4 + 8 + 5)
>>  #define encode_renew_maxsz   (op_encode_hdr_maxsz + 3)
>>  #define decode_renew_maxsz   (op_decode_hdr_maxsz)
>>  #define encode_setclientid_maxsz \
>> @@ -1095,6 +1099,35 @@ static void encode_getattr_two(struct xdr_stream *xdr, uint32_t bm0, uint32_t bm
>>       hdr->replen += decode_getattr_maxsz;
>>  }
>>
>> +static void
>> +encode_getattr_three(struct xdr_stream *xdr,
>> +                  uint32_t bm0, uint32_t bm1, uint32_t bm2,
>> +                  struct compound_hdr *hdr)
>> +{
>> +     __be32 *p;
>> +
>> +     p = reserve_space(xdr, 4);
>> +     *p = cpu_to_be32(OP_GETATTR);
>> +     if (bm2) {
>> +             p = reserve_space(xdr, 16);
>> +             *p++ = cpu_to_be32(3);
>> +             *p++ = cpu_to_be32(bm0);
>> +             *p++ = cpu_to_be32(bm1);
>> +             *p = cpu_to_be32(bm2);
>> +     } else if (bm1) {
>> +             p = reserve_space(xdr, 12);
>> +             *p++ = cpu_to_be32(2);
>> +             *p++ = cpu_to_be32(bm0);
>> +             *p = cpu_to_be32(bm1);
>> +     } else {
>> +             p = reserve_space(xdr, 8);
>> +             *p++ = cpu_to_be32(1);
>> +             *p = cpu_to_be32(bm0);
>> +     }
>> +     hdr->nops++;
>> +     hdr->replen += decode_getattr_maxsz;
>> +}
>> +
>>  static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
>>  {
>>       encode_getattr_two(xdr, bitmask[0] & nfs4_fattr_bitmap[0],
>> @@ -1103,8 +1136,11 @@ static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct c
>>
>>  static void encode_fsinfo(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
>>  {
>> -     encode_getattr_two(xdr, bitmask[0] & nfs4_fsinfo_bitmap[0],
>> -                        bitmask[1] & nfs4_fsinfo_bitmap[1], hdr);
>> +     encode_getattr_three(xdr,
>> +                          bitmask[0] & nfs4_fsinfo_bitmap[0],
>> +                          bitmask[1] & nfs4_fsinfo_bitmap[1],
>> +                          bitmask[2] & nfs4_fsinfo_bitmap[2],
>> +                          hdr);
>>  }
>>
>>  static void encode_fs_locations(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
>> @@ -2575,7 +2611,7 @@ static void nfs4_xdr_enc_setclientid_confirm(struct rpc_rqst *req,
>>       struct compound_hdr hdr = {
>>               .nops   = 0,
>>       };
>> -     const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
>> +     const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };
>>
>>       encode_compound_hdr(xdr, req, &hdr);
>>       encode_setclientid_confirm(xdr, arg, &hdr);
>> @@ -2719,7 +2755,7 @@ static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req,
>>       struct compound_hdr hdr = {
>>               .minorversion = nfs4_xdr_minorversion(&args->la_seq_args),
>>       };
>> -     const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
>> +     const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };
>>
>>       encode_compound_hdr(xdr, req, &hdr);
>>       encode_sequence(xdr, &args->la_seq_args, &hdr);
>> @@ -2947,14 +2983,17 @@ static int decode_attr_bitmap(struct xdr_stream *xdr, uint32_t *bitmap)
>>               goto out_overflow;
>>       bmlen = be32_to_cpup(p);
>>
>> -     bitmap[0] = bitmap[1] = 0;
>> +     bitmap[0] = bitmap[1] = bitmap[2] = 0;
>>       p = xdr_inline_decode(xdr, (bmlen << 2));
>>       if (unlikely(!p))
>>               goto out_overflow;
>>       if (bmlen > 0) {
>>               bitmap[0] = be32_to_cpup(p++);
>> -             if (bmlen > 1)
>> -                     bitmap[1] = be32_to_cpup(p);
>> +             if (bmlen > 1) {
>> +                     bitmap[1] = be32_to_cpup(p++);
>> +                     if (bmlen > 2)
>> +                             bitmap[2] = be32_to_cpup(p);
>> +             }
>>       }
>>       return 0;
>>  out_overflow:
>> @@ -2986,8 +3025,9 @@ static int decode_attr_supported(struct xdr_stream *xdr, uint32_t *bitmap, uint3
>>                       return ret;
>>               bitmap[0] &= ~FATTR4_WORD0_SUPPORTED_ATTRS;
>>       } else
>> -             bitmask[0] = bitmask[1] = 0;
>> -     dprintk("%s: bitmask=%08x:%08x\n", __func__, bitmask[0], bitmask[1]);
>> +             bitmask[0] = bitmask[1] = bitmask[2] = 0;
>> +     dprintk("%s: bitmask=%08x:%08x:%08x\n", __func__,
>> +             bitmask[0], bitmask[1], bitmask[2]);
>>       return 0;
>>  }
>>
>> @@ -4041,7 +4081,7 @@ out_overflow:
>>  static int decode_server_caps(struct xdr_stream *xdr, struct nfs4_server_caps_res *res)
>>  {
>>       __be32 *savep;
>> -     uint32_t attrlen, bitmap[2] = {0};
>> +     uint32_t attrlen, bitmap[3] = {0};
>>       int status;
>>
>>       if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
>> @@ -4067,7 +4107,7 @@ xdr_error:
>>  static int decode_statfs(struct xdr_stream *xdr, struct nfs_fsstat *fsstat)
>>  {
>>       __be32 *savep;
>> -     uint32_t attrlen, bitmap[2] = {0};
>> +     uint32_t attrlen, bitmap[3] = {0};
>>       int status;
>>
>>       if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
>> @@ -4099,7 +4139,7 @@ xdr_error:
>>  static int decode_pathconf(struct xdr_stream *xdr, struct nfs_pathconf *pathconf)
>>  {
>>       __be32 *savep;
>> -     uint32_t attrlen, bitmap[2] = {0};
>> +     uint32_t attrlen, bitmap[3] = {0};
>>       int status;
>>
>>       if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
>> @@ -4239,7 +4279,7 @@ static int decode_getfattr_generic(struct xdr_stream *xdr, struct nfs_fattr *fat
>>  {
>>       __be32 *savep;
>>       uint32_t attrlen,
>> -              bitmap[2] = {0};
>> +              bitmap[3] = {0};
>>       int status;
>>
>>       status = decode_op_hdr(xdr, OP_GETATTR);
>> @@ -4325,10 +4365,32 @@ static int decode_attr_pnfstype(struct xdr_stream *xdr, uint32_t *bitmap,
>>       return status;
>>  }
>>
>> +/*
>> + * The prefered block size for layout directed io
>> + */
>> +static int decode_attr_layout_blksize(struct xdr_stream *xdr, uint32_t *bitmap,
>> +                                   uint32_t *res)
>> +{
>> +     __be32 *p;
>> +
>> +     dprintk("%s: bitmap is %x\n", __func__, bitmap[2]);
>> +     *res = 0;
>> +     if (bitmap[2] & FATTR4_WORD2_LAYOUT_BLKSIZE) {
>> +             p = xdr_inline_decode(xdr, 4);
>> +             if (unlikely(!p)) {
>> +                     print_overflow_msg(__func__, xdr);
>> +                     return -EIO;
>> +             }
>> +             *res = be32_to_cpup(p);
>> +             bitmap[2] &= ~FATTR4_WORD2_LAYOUT_BLKSIZE;
>> +     }
>> +     return 0;
>> +}
>> +
>>  static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
>>  {
>>       __be32 *savep;
>> -     uint32_t attrlen, bitmap[2];
>> +     uint32_t attrlen, bitmap[3];
>>       int status;
>>
>>       if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
>> @@ -4356,6 +4418,9 @@ static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
>>       status = decode_attr_pnfstype(xdr, bitmap, &fsinfo->layouttype);
>>       if (status != 0)
>>               goto xdr_error;
>> +     status = decode_attr_layout_blksize(xdr, bitmap, &fsinfo->blksize);
>> +     if (status)
>> +             goto xdr_error;
>>
>>       status = verify_attr_len(xdr, savep, attrlen);
>>  xdr_error:
>> @@ -4775,7 +4840,7 @@ static int decode_getacl(struct xdr_stream *xdr, struct rpc_rqst *req,
>>  {
>>       __be32 *savep;
>>       uint32_t attrlen,
>> -              bitmap[2] = {0};
>> +              bitmap[3] = {0};
>>       struct kvec *iov = req->rq_rcv_buf.head;
>>       int status;
>>
>> @@ -6605,7 +6670,7 @@ out:
>>  int nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry,
>>                      int plus)
>>  {
>> -     uint32_t bitmap[2] = {0};
>> +     uint32_t bitmap[3] = {0};
>>       uint32_t len;
>>       __be32 *p = xdr_inline_decode(xdr, 4);
>>       if (unlikely(!p))
>> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
>> index 87694ca..79cc4ca 100644
>> --- a/include/linux/nfs_fs_sb.h
>> +++ b/include/linux/nfs_fs_sb.h
>> @@ -130,7 +130,7 @@ struct nfs_server {
>>  #endif
>>
>>  #ifdef CONFIG_NFS_V4
>> -     u32                     attr_bitmask[2];/* V4 bitmask representing the set
>> +     u32                     attr_bitmask[3];/* V4 bitmask representing the set
>>                                                  of attributes supported on this
>>                                                  filesystem */
>>       u32                     cache_consistency_bitmask[2];
>> @@ -143,6 +143,8 @@ struct nfs_server {
>>                                                  filesystem */
>>       struct pnfs_layoutdriver_type  *pnfs_curr_ld; /* Active layout driver */
>>       struct rpc_wait_queue   roc_rpcwaitq;
>> +     void                    *pnfs_ld_data; /* per mount point data */
>> +     u32                     pnfs_blksize; /* layout_blksize attr */
>>
>>       /* the following fields are protected by nfs_client->cl_lock */
>>       struct rb_root          state_owners;
>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
>> index 00442f5..a9c43ba 100644
>> --- a/include/linux/nfs_xdr.h
>> +++ b/include/linux/nfs_xdr.h
>> @@ -122,6 +122,7 @@ struct nfs_fsinfo {
>>       struct timespec         time_delta; /* server time granularity */
>>       __u32                   lease_time; /* in seconds */
>>       __u32                   layouttype; /* supported pnfs layout driver */
>> +     __u32                   blksize; /* preferred pnfs io block size */
>>  };
>>
>>  struct nfs_fsstat {
>> @@ -954,7 +955,7 @@ struct nfs4_server_caps_arg {
>>  };
>>
>>  struct nfs4_server_caps_res {
>> -     u32                             attr_bitmask[2];
>> +     u32                             attr_bitmask[3];
>>       u32                             acl_bitmask;
>>       u32                             has_links;
>>       u32                             has_symlinks;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/34] pnfs: cleanup_layoutcommit
  2011-06-12 23:44 ` [PATCH 06/34] pnfs: cleanup_layoutcommit Jim Rees
  2011-06-13 21:19   ` Benny Halevy
@ 2011-06-14 15:10   ` Benny Halevy
  2011-06-14 15:21     ` Peng Tao
  2011-06-14 15:19   ` Benny Halevy
  2 siblings, 1 reply; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 15:10 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On 2011-06-12 19:44, Jim Rees wrote:
> From: Peng Tao <bergwolf@gmail.com>
> 
> This gives layout driver a chance to cleanup structures they put in.
> Also ensure layoutcommit does not commit more than isize, as block layout
> driver may dirty pages beyond EOF.

let's separate the latter matter into a different patch so we can
discuss the problem and the solution orthogonally to cleanup_layoutcommit.

> 
> Signed-off-by: Andy Adamson <andros@netapp.com>
> [fixup layout header pointer for layoutcommit]
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> Signed-off-by: Peng Tao <bergwolf@gmail.com>
> ---
>  fs/nfs/nfs4proc.c       |    1 +
>  fs/nfs/nfs4xdr.c        |    3 ++-
>  fs/nfs/pnfs.c           |   15 +++++++++++++++
>  fs/nfs/pnfs.h           |    4 ++++
>  include/linux/nfs_xdr.h |    1 +
>  5 files changed, 23 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 5246db8..e27a648 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5890,6 +5890,7 @@ static void nfs4_layoutcommit_release(void *calldata)
>  {
>  	struct nfs4_layoutcommit_data *data = calldata;
>  
> +	pnfs_cleanup_layoutcommit(data->args.inode, data);
>  	/* Matched by references in pnfs_set_layoutcommit */
>  	put_lseg(data->lseg);
>  	put_rpccred(data->cred);
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index fdcbd8f..57295d1 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
>  	*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
>  	/* Only whole file layouts */
>  	p = xdr_encode_hyper(p, 0); /* offset */
> -	p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
> +	p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */

This is unrelated to this particular patch and it should be discussed separately.
(and dropped altogether :)

>  	*p++ = cpu_to_be32(0); /* reclaim */
>  	p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
>  	*p++ = cpu_to_be32(1); /* newoffset = TRUE */
> @@ -5467,6 +5467,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
>  	int status;
>  
>  	status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
> +	res->status = status;

What is res->status used for?

>  	if (status)
>  		return status;
>  
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index e693718..48a06a1 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1248,6 +1248,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  {
>  	struct nfs_inode *nfsi = NFS_I(wdata->inode);
>  	loff_t end_pos = wdata->mds_offset + wdata->res.count;
> +	loff_t isize = i_size_read(wdata->inode);
>  	bool mark_as_dirty = false;
>  
>  	spin_lock(&nfsi->vfs_inode.i_lock);
> @@ -1261,9 +1262,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  		dprintk("%s: Set layoutcommit for inode %lu ",
>  			__func__, wdata->inode->i_ino);
>  	}
> +	if (end_pos > isize)
> +		end_pos = isize;
>  	if (end_pos > wdata->lseg->pls_end_pos)
>  		wdata->lseg->pls_end_pos = end_pos;
>  	spin_unlock(&nfsi->vfs_inode.i_lock);
> +	dprintk("%s: lseg %p end_pos %llu\n",
> +		__func__, wdata->lseg, wdata->lseg->pls_end_pos);
>  
>  	/* if pnfs_layoutcommit_inode() runs between inode locks, the next one
>  	 * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
> @@ -1272,6 +1277,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  }
>  EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
>  
> +void pnfs_cleanup_layoutcommit(struct inode *inode,
> +                               struct nfs4_layoutcommit_data *data)
> +{
> +        struct nfs_server *nfss = NFS_SERVER(inode);
> +
> +        if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
> +                nfss->pnfs_curr_ld->cleanup_layoutcommit(
> +                                        NFS_I(inode)->layout, data);
> +}
> +
>  void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
>  {
>  	/* lseg refcounting handled directly in nfs_write_end */
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 525ec55..5048898 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -127,6 +127,9 @@ struct pnfs_layoutdriver_type {
>  				     struct xdr_stream *xdr,it: 
>  				     const struct nfs4_layoutreturn_args *args);
>  
> +        void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
> +                                      struct nfs4_layoutcommit_data *data);
> +

nit: whitespace cleanup required...

>  	void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
>  				     struct xdr_stream *xdr,
>  				     const struct nfs4_layoutcommit_args *args);
> @@ -213,6 +216,7 @@ void pnfs_roc_release(struct inode *ino);
>  void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
>  bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
>  void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
> +void pnfs_cleanup_layoutcommit(struct inode *inode, struct nfs4_layoutcommit_data *data);
>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>  int _pnfs_return_layout(struct inode *);
>  int pnfs_ld_write_done(struct nfs_write_data *);
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index a9c43ba..2c3ffda 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -270,6 +270,7 @@ struct nfs4_layoutcommit_res {
>  	struct nfs_fattr *fattr;
>  	const struct nfs_server *server;
>  	struct nfs4_sequence_res seq_res;
> +	int status;

This seems to be unused in this patch...

Benny

>  };
>  
>  struct nfs4_layoutcommit_data {

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 07/34] pnfsblock: define PNFS_BLOCK Kconfig option
  2011-06-12 23:44 ` [PATCH 07/34] pnfsblock: define PNFS_BLOCK Kconfig option Jim Rees
@ 2011-06-14 15:13   ` Benny Halevy
  0 siblings, 0 replies; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 15:13 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On 2011-06-12 19:44, Jim Rees wrote:
> From: Fred Isaman <iisaman@citi.umich.edu>
> 
> Define a configuration variable to enable/disable compilation of the
> block driver code.
> 
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> [pnfs-block: fix CONFIG_PNFS_BLOCK dependencies]
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> ---
>  fs/nfs/Kconfig              |    8 ++++++++
>  fs/nfs/Makefile             |    1 +
>  fs/nfs/blocklayout/Makefile |    5 +++++
>  3 files changed, 14 insertions(+), 0 deletions(-)
>  create mode 100644 fs/nfs/blocklayout/Makefile
> 
> diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
> index 8151554..3cebf1b 100644
> --- a/fs/nfs/Kconfig
> +++ b/fs/nfs/Kconfig
> @@ -97,6 +97,14 @@ config PNFS_OBJLAYOUT
>  
>  	  If unsure, say N.
>  
> +config PNFS_BLOCK
> +	tristate "Provide a pNFS block client (EXPERIMENTAL)"
> +	depends on NFS_FS && PNFS

PNFS config option is obsolete.
Needs to be NFS_V4_1

Benny

> +	help
> +	  Say M or y here if you want your pNfs client to support the block protocol
> +
> +	  If unsure, say N.
> +
>  config ROOT_NFS
>  	bool "Root file system on NFS"
>  	depends on NFS_FS=y && IP_PNP
> diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
> index 6a34f7d..b58613d 100644
> --- a/fs/nfs/Makefile
> +++ b/fs/nfs/Makefile
> @@ -23,3 +23,4 @@ obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
>  nfs_layout_nfsv41_files-y := nfs4filelayout.o nfs4filelayoutdev.o
>  
>  obj-$(CONFIG_PNFS_OBJLAYOUT) += objlayout/
> +obj-$(CONFIG_PNFS_BLOCK) += blocklayout/
> diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
> new file mode 100644
> index 0000000..f214c1c
> --- /dev/null
> +++ b/fs/nfs/blocklayout/Makefile
> @@ -0,0 +1,5 @@
> +#
> +# Makefile for the pNFS block layout driver kernel module
> +#
> +obj-$(CONFIG_PNFS_BLOCK) +=
> +blocklayoutdriver-objs :=

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/34] pnfs: cleanup_layoutcommit
  2011-06-13 21:19   ` Benny Halevy
@ 2011-06-14 15:16     ` Peng Tao
  0 siblings, 0 replies; 58+ messages in thread
From: Peng Tao @ 2011-06-14 15:16 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Jim Rees, linux-nfs, peter honeyman

On Tue, Jun 14, 2011 at 5:19 AM, Benny Halevy <bhalevy.lists@gmail.com> wrote:
> On 2011-06-12 19:44, Jim Rees wrote:
>> From: Peng Tao <bergwolf@gmail.com>
>>
>> This gives layout driver a chance to cleanup structures they put in.
>> Also ensure layoutcommit does not commit more than isize, as block layout
>> driver may dirty pages beyond EOF.
>>
>> Signed-off-by: Andy Adamson <andros@netapp.com>
>> [fixup layout header pointer for layoutcommit]
>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>> Signed-off-by: Peng Tao <bergwolf@gmail.com>
>> ---
>>  fs/nfs/nfs4proc.c       |    1 +
>>  fs/nfs/nfs4xdr.c        |    3 ++-
>>  fs/nfs/pnfs.c           |   15 +++++++++++++++
>>  fs/nfs/pnfs.h           |    4 ++++
>>  include/linux/nfs_xdr.h |    1 +
>>  5 files changed, 23 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index 5246db8..e27a648 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -5890,6 +5890,7 @@ static void nfs4_layoutcommit_release(void *calldata)
>>  {
>>       struct nfs4_layoutcommit_data *data = calldata;
>>
>> +     pnfs_cleanup_layoutcommit(data->args.inode, data);
>
> The layout driver better be passed the status on the done method
> rather than on release so that it can roll back on error.
>
> Although it is quite complicated to roll back after permanent errors like
> NFS4ERR_BADLAYOUT where the client is really screwed and it
> essentially needs to redirty and rewrite the data (to the MDS
> to simplify the error handling path), rolling back from
> transient errors like NFS4ERR_DELAY should be fairly easy.
I agree that it can be put in layoutcommit_done. But why is it related
to rolling back in error case? IMHO, layoutcommit error handling
should be implemented in generic code. e.g., for NFS4ERR_DELAY,
current code will retry layoutcommit in generic layer.
pnfs_cleanup_layoutcommit is simply an interface for layout driver to
cleanup its private data associated with this layoutcommit operation.
For block layout specifically, clean up commiting extent list.

Thanks,
Tao

>
> Benny
>
>>       /* Matched by references in pnfs_set_layoutcommit */
>>       put_lseg(data->lseg);
>>       put_rpccred(data->cred);
>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
>> index fdcbd8f..57295d1 100644
>> --- a/fs/nfs/nfs4xdr.c
>> +++ b/fs/nfs/nfs4xdr.c
>> @@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
>>       *p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
>>       /* Only whole file layouts */
>>       p = xdr_encode_hyper(p, 0); /* offset */
>> -     p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
>> +     p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
>>       *p++ = cpu_to_be32(0); /* reclaim */
>>       p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
>>       *p++ = cpu_to_be32(1); /* newoffset = TRUE */
>> @@ -5467,6 +5467,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
>>       int status;
>>
>>       status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
>> +     res->status = status;
>>       if (status)
>>               return status;
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index e693718..48a06a1 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -1248,6 +1248,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>>  {
>>       struct nfs_inode *nfsi = NFS_I(wdata->inode);
>>       loff_t end_pos = wdata->mds_offset + wdata->res.count;
>> +     loff_t isize = i_size_read(wdata->inode);
>>       bool mark_as_dirty = false;
>>
>>       spin_lock(&nfsi->vfs_inode.i_lock);
>> @@ -1261,9 +1262,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>>               dprintk("%s: Set layoutcommit for inode %lu ",
>>                       __func__, wdata->inode->i_ino);
>>       }
>> +     if (end_pos > isize)
>> +             end_pos = isize;
>>       if (end_pos > wdata->lseg->pls_end_pos)
>>               wdata->lseg->pls_end_pos = end_pos;
>>       spin_unlock(&nfsi->vfs_inode.i_lock);
>> +     dprintk("%s: lseg %p end_pos %llu\n",
>> +             __func__, wdata->lseg, wdata->lseg->pls_end_pos);
>>
>>       /* if pnfs_layoutcommit_inode() runs between inode locks, the next one
>>        * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
>> @@ -1272,6 +1277,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>>  }
>>  EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
>>
>> +void pnfs_cleanup_layoutcommit(struct inode *inode,
>> +                               struct nfs4_layoutcommit_data *data)
>> +{
>> +        struct nfs_server *nfss = NFS_SERVER(inode);
>> +
>> +        if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
>> +                nfss->pnfs_curr_ld->cleanup_layoutcommit(
>> +                                        NFS_I(inode)->layout, data);
>> +}
>> +
>>  void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
>>  {
>>       /* lseg refcounting handled directly in nfs_write_end */
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index 525ec55..5048898 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -127,6 +127,9 @@ struct pnfs_layoutdriver_type {
>>                                    struct xdr_stream *xdr,
>>                                    const struct nfs4_layoutreturn_args *args);
>>
>> +        void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
>> +                                      struct nfs4_layoutcommit_data *data);
>> +
>>       void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
>>                                    struct xdr_stream *xdr,
>>                                    const struct nfs4_layoutcommit_args *args);
>> @@ -213,6 +216,7 @@ void pnfs_roc_release(struct inode *ino);
>>  void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
>>  bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
>>  void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
>> +void pnfs_cleanup_layoutcommit(struct inode *inode, struct nfs4_layoutcommit_data *data);
>>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>>  int _pnfs_return_layout(struct inode *);
>>  int pnfs_ld_write_done(struct nfs_write_data *);
>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
>> index a9c43ba..2c3ffda 100644
>> --- a/include/linux/nfs_xdr.h
>> +++ b/include/linux/nfs_xdr.h
>> @@ -270,6 +270,7 @@ struct nfs4_layoutcommit_res {
>>       struct nfs_fattr *fattr;
>>       const struct nfs_server *server;
>>       struct nfs4_sequence_res seq_res;
>> +     int status;
>>  };
>>
>>  struct nfs4_layoutcommit_data {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/34] pnfs: cleanup_layoutcommit
  2011-06-12 23:44 ` [PATCH 06/34] pnfs: cleanup_layoutcommit Jim Rees
  2011-06-13 21:19   ` Benny Halevy
  2011-06-14 15:10   ` Benny Halevy
@ 2011-06-14 15:19   ` Benny Halevy
  2 siblings, 0 replies; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 15:19 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On 2011-06-12 19:44, Jim Rees wrote:
> From: Peng Tao <bergwolf@gmail.com>
> 
> This gives layout driver a chance to cleanup structures they put in.
> Also ensure layoutcommit does not commit more than isize, as block layout
> driver may dirty pages beyond EOF.
> 
> Signed-off-by: Andy Adamson <andros@netapp.com>
> [fixup layout header pointer for layoutcommit]
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> Signed-off-by: Peng Tao <bergwolf@gmail.com>
> ---
>  fs/nfs/nfs4proc.c       |    1 +
>  fs/nfs/nfs4xdr.c        |    3 ++-
>  fs/nfs/pnfs.c           |   15 +++++++++++++++
>  fs/nfs/pnfs.h           |    4 ++++
>  include/linux/nfs_xdr.h |    1 +
>  5 files changed, 23 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 5246db8..e27a648 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5890,6 +5890,7 @@ static void nfs4_layoutcommit_release(void *calldata)
>  {
>  	struct nfs4_layoutcommit_data *data = calldata;
>  
> +	pnfs_cleanup_layoutcommit(data->args.inode, data);

One more issue we've discussed verbally is that this better move to
nfs4_layoutcommit_done and pass the status to the layout driver so
it can roll forward or recover (/ roll back / shout / panic :) respectively.

Benny

>  	/* Matched by references in pnfs_set_layoutcommit */
>  	put_lseg(data->lseg);
>  	put_rpccred(data->cred);
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index fdcbd8f..57295d1 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
>  	*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
>  	/* Only whole file layouts */
>  	p = xdr_encode_hyper(p, 0); /* offset */
> -	p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
> +	p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
>  	*p++ = cpu_to_be32(0); /* reclaim */
>  	p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
>  	*p++ = cpu_to_be32(1); /* newoffset = TRUE */
> @@ -5467,6 +5467,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
>  	int status;
>  
>  	status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
> +	res->status = status;
>  	if (status)
>  		return status;
>  
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index e693718..48a06a1 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1248,6 +1248,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  {
>  	struct nfs_inode *nfsi = NFS_I(wdata->inode);
>  	loff_t end_pos = wdata->mds_offset + wdata->res.count;
> +	loff_t isize = i_size_read(wdata->inode);
>  	bool mark_as_dirty = false;
>  
>  	spin_lock(&nfsi->vfs_inode.i_lock);
> @@ -1261,9 +1262,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  		dprintk("%s: Set layoutcommit for inode %lu ",
>  			__func__, wdata->inode->i_ino);
>  	}
> +	if (end_pos > isize)
> +		end_pos = isize;
>  	if (end_pos > wdata->lseg->pls_end_pos)
>  		wdata->lseg->pls_end_pos = end_pos;
>  	spin_unlock(&nfsi->vfs_inode.i_lock);
> +	dprintk("%s: lseg %p end_pos %llu\n",
> +		__func__, wdata->lseg, wdata->lseg->pls_end_pos);
>  
>  	/* if pnfs_layoutcommit_inode() runs between inode locks, the next one
>  	 * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
> @@ -1272,6 +1277,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  }
>  EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
>  
> +void pnfs_cleanup_layoutcommit(struct inode *inode,
> +                               struct nfs4_layoutcommit_data *data)
> +{
> +        struct nfs_server *nfss = NFS_SERVER(inode);
> +
> +        if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
> +                nfss->pnfs_curr_ld->cleanup_layoutcommit(
> +                                        NFS_I(inode)->layout, data);
> +}
> +
>  void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
>  {
>  	/* lseg refcounting handled directly in nfs_write_end */
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 525ec55..5048898 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -127,6 +127,9 @@ struct pnfs_layoutdriver_type {
>  				     struct xdr_stream *xdr,
>  				     const struct nfs4_layoutreturn_args *args);
>  
> +        void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
> +                                      struct nfs4_layoutcommit_data *data);
> +
>  	void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
>  				     struct xdr_stream *xdr,
>  				     const struct nfs4_layoutcommit_args *args);
> @@ -213,6 +216,7 @@ void pnfs_roc_release(struct inode *ino);
>  void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
>  bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
>  void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
> +void pnfs_cleanup_layoutcommit(struct inode *inode, struct nfs4_layoutcommit_data *data);
>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>  int _pnfs_return_layout(struct inode *);
>  int pnfs_ld_write_done(struct nfs_write_data *);
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index a9c43ba..2c3ffda 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -270,6 +270,7 @@ struct nfs4_layoutcommit_res {
>  	struct nfs_fattr *fattr;
>  	const struct nfs_server *server;
>  	struct nfs4_sequence_res seq_res;
> +	int status;
>  };
>  
>  struct nfs4_layoutcommit_data {

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/34] pnfs: cleanup_layoutcommit
  2011-06-14 15:10   ` Benny Halevy
@ 2011-06-14 15:21     ` Peng Tao
  0 siblings, 0 replies; 58+ messages in thread
From: Peng Tao @ 2011-06-14 15:21 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Jim Rees, linux-nfs, peter honeyman

On Tue, Jun 14, 2011 at 11:10 PM, Benny Halevy <bhalevy.lists@gmail.com> wrote:
> On 2011-06-12 19:44, Jim Rees wrote:
>> From: Peng Tao <bergwolf@gmail.com>
>>
>> This gives layout driver a chance to cleanup structures they put in.
>> Also ensure layoutcommit does not commit more than isize, as block layout
>> driver may dirty pages beyond EOF.
>
> let's separate the latter matter into a different patch so we can
> discuss the problem and the solution orthogonally to cleanup_layoutcommit.
>
>>
>> Signed-off-by: Andy Adamson <andros@netapp.com>
>> [fixup layout header pointer for layoutcommit]
>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>> Signed-off-by: Peng Tao <bergwolf@gmail.com>
>> ---
>>  fs/nfs/nfs4proc.c       |    1 +
>>  fs/nfs/nfs4xdr.c        |    3 ++-
>>  fs/nfs/pnfs.c           |   15 +++++++++++++++
>>  fs/nfs/pnfs.h           |    4 ++++
>>  include/linux/nfs_xdr.h |    1 +
>>  5 files changed, 23 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index 5246db8..e27a648 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -5890,6 +5890,7 @@ static void nfs4_layoutcommit_release(void *calldata)
>>  {
>>       struct nfs4_layoutcommit_data *data = calldata;
>>
>> +     pnfs_cleanup_layoutcommit(data->args.inode, data);
>>       /* Matched by references in pnfs_set_layoutcommit */
>>       put_lseg(data->lseg);
>>       put_rpccred(data->cred);
>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
>> index fdcbd8f..57295d1 100644
>> --- a/fs/nfs/nfs4xdr.c
>> +++ b/fs/nfs/nfs4xdr.c
>> @@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
>>       *p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
>>       /* Only whole file layouts */
>>       p = xdr_encode_hyper(p, 0); /* offset */
>> -     p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
>> +     p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
>
> This is unrelated to this particular patch and it should be discussed separately.
> (and dropped altogether :)
>
>>       *p++ = cpu_to_be32(0); /* reclaim */
>>       p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
>>       *p++ = cpu_to_be32(1); /* newoffset = TRUE */
>> @@ -5467,6 +5467,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
>>       int status;
>>
>>       status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
>> +     res->status = status;
>
> What is res->status used for?
block layout driver use it to determine if it should clean up
commiting extent list or put them back to dirty extent list (which is
probably wrong but it remains to be seen when layoutcommit error
handling for layoutcommit is implemented in generic layer).

>
>>       if (status)
>>               return status;
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index e693718..48a06a1 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -1248,6 +1248,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>>  {
>>       struct nfs_inode *nfsi = NFS_I(wdata->inode);
>>       loff_t end_pos = wdata->mds_offset + wdata->res.count;
>> +     loff_t isize = i_size_read(wdata->inode);
>>       bool mark_as_dirty = false;
>>
>>       spin_lock(&nfsi->vfs_inode.i_lock);
>> @@ -1261,9 +1262,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>>               dprintk("%s: Set layoutcommit for inode %lu ",
>>                       __func__, wdata->inode->i_ino);
>>       }
>> +     if (end_pos > isize)
>> +             end_pos = isize;
>>       if (end_pos > wdata->lseg->pls_end_pos)
>>               wdata->lseg->pls_end_pos = end_pos;
>>       spin_unlock(&nfsi->vfs_inode.i_lock);
>> +     dprintk("%s: lseg %p end_pos %llu\n",
>> +             __func__, wdata->lseg, wdata->lseg->pls_end_pos);
>>
>>       /* if pnfs_layoutcommit_inode() runs between inode locks, the next one
>>        * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
>> @@ -1272,6 +1277,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>>  }
>>  EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
>>
>> +void pnfs_cleanup_layoutcommit(struct inode *inode,
>> +                               struct nfs4_layoutcommit_data *data)
>> +{
>> +        struct nfs_server *nfss = NFS_SERVER(inode);
>> +
>> +        if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
>> +                nfss->pnfs_curr_ld->cleanup_layoutcommit(
>> +                                        NFS_I(inode)->layout, data);
>> +}
>> +
>>  void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
>>  {
>>       /* lseg refcounting handled directly in nfs_write_end */
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index 525ec55..5048898 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -127,6 +127,9 @@ struct pnfs_layoutdriver_type {
>>                                    struct xdr_stream *xdr,it:
>>                                    const struct nfs4_layoutreturn_args *args);
>>
>> +        void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
>> +                                      struct nfs4_layoutcommit_data *data);
>> +
>
> nit: whitespace cleanup required...
>
>>       void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
>>                                    struct xdr_stream *xdr,
>>                                    const struct nfs4_layoutcommit_args *args);
>> @@ -213,6 +216,7 @@ void pnfs_roc_release(struct inode *ino);
>>  void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
>>  bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
>>  void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
>> +void pnfs_cleanup_layoutcommit(struct inode *inode, struct nfs4_layoutcommit_data *data);
>>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>>  int _pnfs_return_layout(struct inode *);
>>  int pnfs_ld_write_done(struct nfs_write_data *);
>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
>> index a9c43ba..2c3ffda 100644
>> --- a/include/linux/nfs_xdr.h
>> +++ b/include/linux/nfs_xdr.h
>> @@ -270,6 +270,7 @@ struct nfs4_layoutcommit_res {
>>       struct nfs_fattr *fattr;
>>       const struct nfs_server *server;
>>       struct nfs4_sequence_res seq_res;
>> +     int status;
>
> This seems to be unused in this patch...
>
> Benny
>
>>  };
>>
>>  struct nfs4_layoutcommit_data {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Thanks,
-Bergwolf

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 17/34] pnfsblock: call and parse getdevicelist
  2011-06-12 23:44 ` [PATCH 17/34] pnfsblock: call and parse getdevicelist Jim Rees
@ 2011-06-14 15:36   ` Benny Halevy
  0 siblings, 0 replies; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 15:36 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On 2011-06-12 19:44, Jim Rees wrote:
> From: Fred Isaman <iisaman@citi.umich.edu>
> 
> Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO
> for each device returned.
> 
> [pnfsblock: fix pnfs_deviceid references]
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> [pnfsblock: fix print format warnings for sector_t and size_t]
> [pnfs-block: #include <linux/vmalloc.h>]
> [pnfsblock: no PNFS_NFS_SERVER]
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> [pnfsblock: fix bug determining size of striped volume]
> [pnfsblock: fix oops when using multiple devices]
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> ---
>  fs/nfs/blocklayout/blocklayout.c |  155 +++++++++++++++++++++++++++++++++++++-
>  fs/nfs/blocklayout/blocklayout.h |   95 +++++++++++++++++++++++
>  2 files changed, 248 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 88b9d1a..36374f4 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -31,7 +31,7 @@
>   */
>  #include <linux/module.h>
>  #include <linux/init.h>
> -
> +#include <linux/vmalloc.h>
>  #include "blocklayout.h"
>  
>  #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
> @@ -164,17 +164,168 @@ bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
>  {
>  }
>  
> +static void free_blk_mountid(struct block_mount_id *mid)
> +{
> +	if (mid) {
> +		struct pnfs_block_dev *dev;
> +		spin_lock(&mid->bm_lock);
> +		while (!list_empty(&mid->bm_devlist)) {
> +			dev = list_first_entry(&mid->bm_devlist,
> +					       struct pnfs_block_dev,
> +					       bm_node);
> +			list_del(&dev->bm_node);
> +			free_block_dev(dev);
> +		}
> +		spin_unlock(&mid->bm_lock);
> +		kfree(mid);
> +	}
> +}
> +
> +/* This is mostly copied from the filelayout's get_device_info function.
> + * It seems much of this should be at the generic pnfs level.
> + */
> +static struct pnfs_block_dev *
> +nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
> +			struct nfs4_deviceid *d_id,
> +			struct list_head *sdlist)
> +{
> +	struct pnfs_device *dev;
> +	struct pnfs_block_dev *rv = NULL;
> +	u32 max_resp_sz;
> +	int max_pages;
> +	struct page **pages = NULL;
> +	int i, rc;
> +
> +	/*
> +	 * Use the session max response size as the basis for setting
> +	 * GETDEVICEINFO's maxcount
> +	 */
> +	max_resp_sz = server->nfs_client->cl_session->fc_attrs.max_resp_sz;
> +	max_pages = max_resp_sz >> PAGE_SHIFT;
> +	dprintk("%s max_resp_sz %u max_pages %d\n",
> +		__func__, max_resp_sz, max_pages);
> +
> +	dev = kmalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev) {
> +		dprintk("%s kmalloc failed\n", __func__);
> +		return NULL;
> +	}
> +
> +	pages = kzalloc(max_pages * sizeof(struct page *), GFP_KERNEL);
> +	if (pages == NULL) {
> +		kfree(dev);
> +		return NULL;
> +	}
> +	for (i = 0; i < max_pages; i++) {
> +		pages[i] = alloc_page(GFP_KERNEL);
> +		if (!pages[i])
> +			goto out_free;
> +	}
> +
> +	/* set dev->area */
> +	dev->area = vmap(pages, max_pages, VM_MAP, PAGE_KERNEL);
> +	if (!dev->area)
> +		goto out_free;
> +
> +	memcpy(&dev->dev_id, d_id, sizeof(*d_id));
> +	dev->layout_type = LAYOUT_BLOCK_VOLUME;
> +	dev->pages = pages;
> +	dev->pgbase = 0;
> +	dev->pglen = PAGE_SIZE * max_pages;
> +	dev->mincount = 0;
> +
> +	dprintk("%s: dev_id: %s\n", __func__, dev->dev_id.data);
> +	rc = nfs4_proc_getdeviceinfo(server, dev);
> +	dprintk("%s getdevice info returns %d\n", __func__, rc);
> +	if (rc)
> +		goto out_free;
> +
> +	rv = nfs4_blk_decode_device(server, dev, sdlist);
> + out_free:
> +	if (dev->area != NULL)
> +		vunmap(dev->area);
> +	for (i = 0; i < max_pages; i++)
> +		__free_page(pages[i]);
> +	kfree(pages);
> +	kfree(dev);
> +	return rv;
> +}
> +
>  static int
>  bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
>  {
> +	struct block_mount_id *b_mt_id = NULL;
> +	struct pnfs_mount_type *mtype = NULL;
> +	struct pnfs_devicelist *dlist = NULL;
> +	struct pnfs_block_dev *bdev;
> +	LIST_HEAD(block_disklist);
> +	int status = 0, i;
> +
>  	dprintk("%s enter\n", __func__);
> -	return 0;
> +
> +	if (server->pnfs_blksize == 0) {
> +		dprintk("%s Server did not return blksize\n", __func__);
> +		return -EINVAL;
> +	}
> +	b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_KERNEL);
> +	if (!b_mt_id) {
> +		status = -ENOMEM;
> +		goto out_error;
> +	}
> +	/* Initialize nfs4 block layout mount id */
> +	spin_lock_init(&b_mt_id->bm_lock);
> +	INIT_LIST_HEAD(&b_mt_id->bm_devlist);
> +
> +	dlist = kmalloc(sizeof(struct pnfs_devicelist), GFP_KERNEL);
> +	if (!dlist)
> +		goto out_error;
> +	dlist->eof = 0;
> +	while (!dlist->eof) {
> +		status = nfs4_proc_getdevicelist(server, fh, dlist);
> +		if (status)
> +			goto out_error;
> +		dprintk("%s GETDEVICELIST numdevs=%i, eof=%i\n",
> +			__func__, dlist->num_devs, dlist->eof);
> +		/* For each device returned in dlist, call GETDEVICEINFO, and
> +		 * decode the opaque topology encoding to create a flat
> +		 * volume topology, matching VOLUME_SIMPLE disk signatures
> +		 * to disks in the visible block disk list.
> +		 * Construct an LVM meta device from the flat volume topology.
> +		 */
> +		for (i = 0; i < dlist->num_devs; i++) {
> +			bdev = nfs4_blk_get_deviceinfo(server, fh,
> +						     &dlist->dev_id[i],
> +						     &block_disklist);
> +			if (!bdev) {
> +				status = -ENODEV;
> +				goto out_error;
> +			}
> +			spin_lock(&b_mt_id->bm_lock);
> +			list_add(&bdev->bm_node, &b_mt_id->bm_devlist);
> +			spin_unlock(&b_mt_id->bm_lock);
> +		}
> +	}
> +	dprintk("%s SUCCESS\n", __func__);
> +	server->pnfs_ld_data = b_mt_id;
> +
> + out_return:
> +	kfree(dlist);
> +	return status;
> +
> + out_error:
> +	free_blk_mountid(b_mt_id);
> +	kfree(mtype);
> +	goto out_return;
>  }
>  
>  static int
>  bl_clear_layoutdriver(struct nfs_server *server)
>  {
> +	struct block_mount_id *b_mt_id = server->pnfs_ld_data;
> +
>  	dprintk("%s enter\n", __func__);
> +	free_blk_mountid(b_mt_id);
> +	dprintk("%s RETURNS\n", __func__);
>  	return 0;
>  }
>  
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index 6bbfc3d..21fa21c 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -35,12 +35,60 @@
>  #include <linux/nfs_fs.h>
>  #include "../pnfs.h"
>  
> +struct block_mount_id {
> +	spinlock_t			bm_lock;    /* protects list */
> +	struct list_head		bm_devlist; /* holds pnfs_block_dev */
> +};
> +
>  struct pnfs_block_dev {
>  	struct list_head		bm_node;
>  	struct nfs4_deviceid		bm_mdevid;    /* associated devid */
>  	struct block_device		*bm_mdev;     /* meta device itself */
>  };
>  
> +/* holds visible disks that can be matched against VOLUME_SIMPLE signatures */
> +struct visible_block_device {
> +	struct list_head	vi_node;
> +	struct block_device	*vi_bdev;
> +	int			vi_mapped;
> +	int			vi_put_done;
> +};
> +
> +enum blk_vol_type {
> +	PNFS_BLOCK_VOLUME_SIMPLE   = 0,	/* maps to a single LU */
> +	PNFS_BLOCK_VOLUME_SLICE    = 1,	/* slice of another volume */
> +	PNFS_BLOCK_VOLUME_CONCAT   = 2,	/* concatenation of multiple volumes */
> +	PNFS_BLOCK_VOLUME_STRIPE   = 3	/* striped across multiple volumes */
> +};
> +
> +/* All disk offset/lengths are stored in 512-byte sectors */
> +struct pnfs_blk_volume {
> +	uint32_t		bv_type;
> +	sector_t 		bv_size;
> +	struct pnfs_blk_volume 	**bv_vols;
> +	int 			bv_vol_n;
> +	union {
> +		dev_t			bv_dev;
> +		sector_t		bv_stripe_unit;
> +		sector_t 		bv_offset;
> +	};
> +};
> +
> +/* Since components need not be aligned, cannot use sector_t */
> +struct pnfs_blk_sig_comp {
> +	int64_t 	bs_offset;  /* In bytes */
> +	uint32_t   	bs_length;  /* In bytes */
> +	char 		*bs_string;
> +};
> +
> +/* Maximum number of signatures components in a simple volume */
> +# define PNFS_BLOCK_MAX_SIG_COMP 16
> +
> +struct pnfs_blk_sig {
> +	int 				si_num_comps;
> +	struct pnfs_blk_sig_comp	si_comps[PNFS_BLOCK_MAX_SIG_COMP];
> +};
> +
>  enum exstate4 {
>  	PNFS_BLOCK_READWRITE_DATA	= 0,
>  	PNFS_BLOCK_READ_DATA		= 1,
> @@ -96,6 +144,8 @@ struct pnfs_block_layout {
>  	sector_t		bl_blocksize;  /* Server blocksize in sectors */
>  };
>  
> +#define BLK_ID(lo) ((struct block_mount_id *)(NFS_SERVER(lo->plh_inode)->pnfs_ld_data))
> +
>  static inline struct pnfs_block_layout *
>  BLK_LO2EXT(struct pnfs_layout_hdr *lo)
>  {
> @@ -108,6 +158,51 @@ BLK_LSEG2EXT(struct pnfs_layout_segment *lseg)
>          return BLK_LO2EXT(lseg->pls_layout);
>  }
>  
> +uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
> +
> +#define BLK_READBUF(p, e, nbytes)  do { \
> +	p = blk_overflow(p, e, nbytes); \
> +	if (!p) { \
> +		printk(KERN_WARNING \
> +			"%s: reply buffer overflowed in line %d.\n", \
> +			__func__, __LINE__); \
> +		goto out_err; \
> +	} \
> +} while (0)
> +
> +#define READ32(x)         (x) = ntohl(*p++)
> +#define READ64(x)         do {                  \
> +	(x) = (uint64_t)ntohl(*p++) << 32;           \
> +	(x) |= ntohl(*p++);                     \
> +} while (0)
> +#define COPYMEM(x, nbytes) do {                 \
> +	memcpy((x), p, nbytes);                 \
> +	p += XDR_QUADLEN(nbytes);               \
> +} while (0)
> +#define READ_DEVID(x)	COPYMEM((x)->data, NFS4_DEVICEID4_SIZE)
> +#define READ_SECTOR(x)     do { \
> +	READ64(tmp); \
> +	if (tmp & 0x1ff) { \
> +		printk(KERN_WARNING \
> +		       "%s Value not 512-byte aligned at line %d\n", \
> +		       __func__, __LINE__);			     \
> +		goto out_err; \
> +	} \
> +	(x) = tmp >> 9; \
> +} while (0)
> +
> +#define WRITE32(n)               do { \
> +	*p++ = htonl(n); \
> +	} while (0)
> +#define WRITE64(n)               do {                           \
> +	*p++ = htonl((uint32_t)((n) >> 32));			\
> +	*p++ = htonl((uint32_t)(n));				\
> +} while (0)
> +#define WRITEMEM(ptr, nbytes)     do {                          \
> +	p = xdr_encode_opaque_fixed(p, ptr, nbytes);	\
> +} while (0)
> +#define WRITE_DEVID(x)  WRITEMEM((x)->data, NFS4_DEVICEID4_SIZE)
> +

please don't use these obsolete macros and rather directly use the
official xdr {en,de}coding helpers and be32_to_cpu.
we're trying to eradicate them from the nfs client

Benny

>  /* blocklayoutdev.c */
>  struct block_device *nfs4_blkdev_get(dev_t dev);
>  int nfs4_blkdev_put(struct block_device *bdev);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 21/34] pnfsblock: SPLITME: add extent manipulation functions
  2011-06-12 23:44 ` [PATCH 21/34] pnfsblock: SPLITME: add extent manipulation functions Jim Rees
@ 2011-06-14 15:40   ` Benny Halevy
  0 siblings, 0 replies; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 15:40 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

Regarding the "SPLITME", please either fix the commit message
or split the patch :)
(I'm in favour of keeping this patch as it is)

Benny

On 2011-06-12 19:44, Jim Rees wrote:
> From: Fred Isaman <iisaman@citi.umich.edu>
>  as it i
> Adds working implementations of various support functions
> to handle INVAL extents, needed by writes, such as
> mark_initialized_sectors and is_sector_initialized.
> 
> SPLIT: this needs to be split into the exported functions, and the
> range support functions (which will be replaced eventually.)
> 
> [pnfsblock: fix 64-bit compiler warnings for extent manipulation]
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> ---
>  fs/nfs/blocklayout/blocklayout.h |   30 ++++-
>  fs/nfs/blocklayout/extents.c     |  253 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 281 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index 06aa36a..a231d49 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -35,6 +35,8 @@
>  #include <linux/nfs_fs.h>
>  #include "../pnfs.h"
>  
> +#define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)
> +
>  #define PG_pnfserr PG_owner_priv_1
>  #define PagePnfsErr(page)	test_bit(PG_pnfserr, &(page)->flags)
>  #define SetPagePnfsErr(page)	set_bit(PG_pnfserr, &(page)->flags)
> @@ -101,8 +103,23 @@ enum exstate4 {
>  	PNFS_BLOCK_NONE_DATA		= 3  /* unmapped, it's a hole */
>  };
>  
> +#define MY_MAX_TAGS (15) /* tag bitnums used must be less than this */
> +
> +struct my_tree_t {
> +	sector_t		mtt_step_size;	/* Internal sector alignment */
> +	struct list_head	mtt_stub; /* Should be a radix tree */
> +};
> +
>  struct pnfs_inval_markings {
> -	/* STUB */
> +	spinlock_t	im_lock;
> +	struct my_tree_t im_tree;	/* Sectors that need LAYOUTCOMMIT */
> +	sector_t	im_block_size;	/* Server blocksize in sectors */
> +};
> +
> +struct pnfs_inval_tracking {
> +	struct list_head it_link;
> +	int		 it_sector;
> +	int		 it_tags;
>  };
>  
>  /* sector_t fields are all in 512-byte sectors */
> @@ -121,7 +138,11 @@ struct pnfs_block_extent {
>  static inline void
>  INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
>  {
> -	/* STUB */
> +	spin_lock_init(&marks->im_lock);
> +	INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
> +	marks->im_block_size = blocksize;
> +	marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
> +					   blocksize);
>  }
>  
>  enum extentclass4 {
> @@ -222,8 +243,13 @@ void free_block_dev(struct pnfs_block_dev *bdev);
>  struct pnfs_block_extent *
>  find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
>  		struct pnfs_block_extent **cow_read);
> +int mark_initialized_sectors(struct pnfs_inval_markings *marks,
> +			     sector_t offset, sector_t length,
> +			     sector_t **pages);
>  void put_extent(struct pnfs_block_extent *be);
>  struct pnfs_block_extent *alloc_extent(void);
> +struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
> +int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
>  int add_and_merge_extent(struct pnfs_block_layout *bl,
>  			 struct pnfs_block_extent *new);
>  
> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> index f0b3f13..3d36f66 100644
> --- a/fs/nfs/blocklayout/extents.c
> +++ b/fs/nfs/blocklayout/extents.c
> @@ -33,6 +33,259 @@
>  #include "blocklayout.h"
>  #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
>  
> +/* Bit numbers */
> +#define EXTENT_INITIALIZED 0
> +#define EXTENT_WRITTEN     1
> +#define EXTENT_IN_COMMIT   2
> +#define INTERNAL_EXISTS    MY_MAX_TAGS
> +#define INTERNAL_MASK      ((1 << INTERNAL_EXISTS) - 1)
> +
> +/* Returns largest t<=s s.t. t%base==0 */
> +static inline sector_t normalize(sector_t s, int base)
> +{
> +	sector_t tmp = s; /* Since do_div modifies its argument */
> +	return s - do_div(tmp, base);
> +}
> +
> +static inline sector_t normalize_up(sector_t s, int base)
> +{
> +	return normalize(s + base - 1, base);
> +}
> +
> +/* Complete stub using list while determine API wanted */
> +
> +/* Returns tags, or negative */
> +static int32_t _find_entry(struct my_tree_t *tree, u64 s)
> +{
> +	struct pnfs_inval_tracking *pos;
> +
> +	dprintk("%s(%llu) enter\n", __func__, s);
> +	list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
> +		if (pos->it_sector > s)
> +			continue;
> +		else if (pos->it_sector == s)
> +			return pos->it_tags & INTERNAL_MASK;
> +		else
> +			break;
> +	}
> +	return -ENOENT;
> +}
> +
> +static inline
> +int _has_tag(struct my_tree_t *tree, u64 s, int32_t tag)
> +{
> +	int32_t tags;
> +
> +	dprintk("%s(%llu, %i) enter\n", __func__, s, tag);
> +	s = normalize(s, tree->mtt_step_size);
> +	tags = _find_entry(tree, s);
> +	if ((tags < 0) || !(tags & (1 << tag)))
> +		return 0;
> +	else
> +		return 1;
> +}
> +
> +/* Creates entry with tag, or if entry already exists, unions tag to it.
> + * If storage is not NULL, newly created entry will use it.
> + * Returns number of entries added, or negative on error.
> + */
> +static int _add_entry(struct my_tree_t *tree, u64 s, int32_t tag,
> +		      struct pnfs_inval_tracking *storage)
> +{
> +	int found = 0;
> +	struct pnfs_inval_tracking *pos;
> +
> +	dprintk("%s(%llu, %i, %p) enter\n", __func__, s, tag, storage);
> +	list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
> +		if (pos->it_sector > s)
> +			continue;
> +		else if (pos->it_sector == s) {
> +			found = 1;
> +			break;
> +		} else
> +			break;
> +	}
> +	if (found) {
> +		pos->it_tags |= (1 << tag);
> +		return 0;
> +	} else {
> +		struct pnfs_inval_tracking *new;
> +		if (storage)
> +			new = storage;
> +		else {
> +			new = kmalloc(sizeof(*new), GFP_KERNEL);
> +			if (!new)
> +				return -ENOMEM;
> +		}
> +		new->it_sector = s;
> +		new->it_tags = (1 << tag);
> +		list_add(&new->it_link, &pos->it_link);
> +		return 1;
> +	}
> +}
> +
> +/* XXXX Really want option to not create */
> +/* Over range, unions tag with existing entries, else creates entry with tag */
> +static int _set_range(struct my_tree_t *tree, int32_t tag, u64 s, u64 length)
> +{
> +	u64 i;
> +
> +	dprintk("%s(%i, %llu, %llu) enter\n", __func__, tag, s, length);
> +	for (i = normalize(s, tree->mtt_step_size); i < s + length;
> +	     i += tree->mtt_step_size)
> +		if (_add_entry(tree, i, tag, NULL))
> +			return -ENOMEM;
> +	return 0;
> +}
> +
> +/* Ensure that future operations on given range of tree will not malloc */
> +static int _preload_range(struct my_tree_t *tree, u64 offset, u64 length)
> +{
> +	u64 start, end, s;
> +	int count, i, used = 0, status = -ENOMEM;
> +	struct pnfs_inval_tracking **storage;
> +
> +	dprintk("%s(%llu, %llu) enter\n", __func__, offset, length);
> +	start = normalize(offset, tree->mtt_step_size);
> +	end = normalize_up(offset + length, tree->mtt_step_size);
> +	count = (int)(end - start) / (int)tree->mtt_step_size;
> +
> +	/* Pre-malloc what memory we might need */
> +	storage = kmalloc(sizeof(*storage) * count, GFP_KERNEL);
> +	if (!storage)
> +		return -ENOMEM;
> +	for (i = 0; i < count; i++) {
> +		storage[i] = kmalloc(sizeof(struct pnfs_inval_tracking),
> +				     GFP_KERNEL);
> +		if (!storage[i])
> +			goto out_cleanup;
> +	}
> +
> +	/* Now need lock - HOW??? */
> +
> +	for (s = start; s < end; s += tree->mtt_step_size)
> +		used += _add_entry(tree, s, INTERNAL_EXISTS, storage[used]);
> +
> +	/* Unlock - HOW??? */
> +	status = 0;
> +
> + out_cleanup:
> +	for (i = used; i < count; i++) {
> +		if (!storage[i])
> +			break;
> +		kfree(storage[i]);
> +	}
> +	kfree(storage);
> +	return status;
> +}
> +
> +static void set_needs_init(sector_t *array, sector_t offset)
> +{
> +	sector_t *p = array;
> +
> +	dprintk("%s enter\n", __func__);
> +	if (!p)
> +		return;
> +	while (*p < offset)
> +		p++;
> +	if (*p == offset)
> +		return;
> +	else if (*p == ~0) {
> +		*p++ = offset;
> +		*p = ~0;
> +		return;
> +	} else {
> +		sector_t *save = p;
> +		dprintk("%s Adding %llu\n", __func__, (u64)offset);
> +		while (*p != ~0)
> +			p++;
> +		p++;
> +		memmove(save + 1, save, (char *)p - (char *)save);
> +		*save = offset;
> +		return;
> +	}
> +}
> +
> +/* We are relying on page lock to serialize this */
> +int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
> +{
> +	int rv;
> +
> +	spin_lock(&marks->im_lock);
> +	rv = _has_tag(&marks->im_tree, isect, EXTENT_INITIALIZED);
> +	spin_unlock(&marks->im_lock);
> +	return rv;
> +}
> +
> +/* Marks sectors in [offest, offset_length) as having been initialized.
> + * All lengths are step-aligned, where step is min(pagesize, blocksize).
> + * Notes where partial block is initialized, and helps prepare it for
> + * complete initialization later.
> + */
> +/* Currently assumes offset is page-aligned */
> +int mark_initialized_sectors(struct pnfs_inval_markings *marks,
> +			     sector_t offset, sector_t length,
> +			     sector_t **pages)
> +{
> +	sector_t s, start, end;
> +	sector_t *array = NULL; /* Pages to mark */
> +
> +	dprintk("%s(offset=%llu,len=%llu) enter\n",
> +		__func__, (u64)offset, (u64)length);
> +	s = max((sector_t) 3,
> +		2 * (marks->im_block_size / (PAGE_CACHE_SECTORS)));
> +	dprintk("%s set max=%llu\n", __func__, (u64)s);
> +	if (pages) {
> +		array = kmalloc(s * sizeof(sector_t), GFP_KERNEL);
> +		if (!array)
> +			goto outerr;
> +		array[0] = ~0;
> +	}
> +
> +	start = normalize(offset, marks->im_block_size);
> +	end = normalize_up(offset + length, marks->im_block_size);
> +	if (_preload_range(&marks->im_tree, start, end - start))
> +		goto outerr;
> +
> +	spin_lock(&marks->im_lock);
> +
> +	for (s = normalize_up(start, PAGE_CACHE_SECTORS);
> +	     s < offset; s += PAGE_CACHE_SECTORS) {
> +		dprintk("%s pre-area pages\n", __func__);
> +		/* Portion of used block is not initialized */
> +		if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
> +			set_needs_init(array, s);
> +	}
> +	if (_set_range(&marks->im_tree, EXTENT_INITIALIZED, offset, length))
> +		goto out_unlock;
> +	for (s = normalize_up(offset + length, PAGE_CACHE_SECTORS);
> +	     s < end; s += PAGE_CACHE_SECTORS) {
> +		dprintk("%s post-area pages\n", __func__);
> +		if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
> +			set_needs_init(array, s);
> +	}
> +
> +	spin_unlock(&marks->im_lock);
> +
> +	if (pages) {
> +		if (array[0] == ~0) {
> +			kfree(array);
> +			*pages = NULL;
> +		} else
> +			*pages = array;
> +	}
> +	return 0;
> +
> + out_unlock:
> +	spin_unlock(&marks->im_lock);
> + outerr:
> +	if (pages) {
> +		kfree(array);
> +		*pages = NULL;
> +	}
> +	return -ENOMEM;
> +}
> +
>  static void print_bl_extent(struct pnfs_block_extent *be)
>  {
>  	dprintk("PRINT EXTENT extent %p\n", be);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 23/34] pnfsblock: encode_layoutcommit
  2011-06-12 23:44 ` [PATCH 23/34] pnfsblock: encode_layoutcommit Jim Rees
@ 2011-06-14 15:44   ` Benny Halevy
  0 siblings, 0 replies; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 15:44 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman

On 2011-06-12 19:44, Jim Rees wrote:
> From: Fred Isaman <iisaman@citi.umich.edu>
> 
> In blocklayout driver. There are two things happening
> while layoutcommit/cleanup.
> 1. the modified extents are encoded.
> 2. On cleanup the extents are put back on the layout rw
>    extents list, for reads.
> 
> In the new system where actual xdr encoding is done in
> encode_layoutcommit() directly into xdr buffer, these are
> the new commit stages:
> 
> 1. On setup_layoutcommit, the range is adjusted as before
>    and a structure is allocated for communication with
>    bl_encode_layoutcommit && bl_cleanup_layoutcommit
>    (Generic layer provides a void-star to hang it on)
> 
> 2. bl_encode_layoutcommit is called to do the actual
>    encoding directly into xdr. The commit-extent-list is not
>    freed and is stored on above structure.
>    FIXME: The code is not yet converted to the new XDR cleanup

please fix for submission...

> 
> 3. On cleanup the commit-extent-list is put back by a call
>    to set_to_rw() as before, but with no need for XDR decoding
>    of the list as before. And the commit-extent-list is freed.
>    Finally allocated structure is freed.
> 
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> [blocklayout: encode_layoutcommit implementation]
> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
> [pnfsblock: fix bug setting up layoutcommit.]
> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
> [pnfsblock: prevent commit list corruption]
> [pnfsblock: fix layoutcommit with an empty opaque]
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> ---
>  fs/nfs/blocklayout/blocklayout.c |    2 +
>  fs/nfs/blocklayout/blocklayout.h |   12 +++
>  fs/nfs/blocklayout/extents.c     |  175 ++++++++++++++++++++++++++++----------
>  3 files changed, 145 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 36374f4..1c9a5d0 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -156,6 +156,8 @@ static void
>  bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
>  		       const struct nfs4_layoutcommit_args *arg)
>  {
> +	dprintk("%s enter\n", __func__);

do we really need that debug printout?

> +	encode_pnfs_block_layoutupdate(BLK_LO2EXT(lo), xdr, arg);
>  }
>  
>  static void
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index a231d49..03d703b 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -135,6 +135,15 @@ struct pnfs_block_extent {
>  	struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
>  };
>  
> +/* Shortened extent used by LAYOUTCOMMIT */
> +struct pnfs_block_short_extent {
> +	struct list_head bse_node;
> +	struct nfs4_deviceid bse_devid;	/* STUB - removable??? */
> +	struct block_device *bse_mdev;
> +	sector_t	bse_f_offset;	/* the starting offset in the file */
> +	sector_t	bse_length;	/* the size of the extent */
> +};
> +
>  static inline void
>  INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
>  {
> @@ -250,6 +259,9 @@ void put_extent(struct pnfs_block_extent *be);
>  struct pnfs_block_extent *alloc_extent(void);
>  struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
>  int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
> +int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
> +				   struct xdr_stream *xdr,
> +				   const struct nfs4_layoutcommit_args *arg);
>  int add_and_merge_extent(struct pnfs_block_layout *bl,
>  			 struct pnfs_block_extent *new);
>  
> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> index 43a3601..e754d32 100644
> --- a/fs/nfs/blocklayout/extents.c
> +++ b/fs/nfs/blocklayout/extents.c
> @@ -286,6 +286,47 @@ int mark_initialized_sectors(struct pnfs_inval_markings *marks,
>  	return -ENOMEM;
>  }
>  
> +/* Marks sectors in [offest, offset+length) as having been written to disk.
> + * All lengths should be block aligned.
> + */
> +int mark_written_sectors(struct pnfs_inval_markings *marks,
> +			 sector_t offset, sector_t length)
> +{
> +	int status;
> +
> +	dprintk("%s(offset=%llu,len=%llu) enter\n", __func__,
> +		(u64)offset, (u64)length);
> +	spin_lock(&marks->im_lock);
> +	status = _set_range(&marks->im_tree, EXTENT_WRITTEN, offset, length);
> +	spin_unlock(&marks->im_lock);
> +	return status;
> +}
> +
> +static void print_short_extent(struct pnfs_block_short_extent *be)
> +{
> +	dprintk("PRINT SHORT EXTENT extent %p\n", be);
> +	if (be) {
> +		dprintk("        be_f_offset %llu\n", (u64)be->bse_f_offset);
> +		dprintk("        be_length   %llu\n", (u64)be->bse_length);
> +	}
> +}
> +
> +void print_clist(struct list_head *list, unsigned int count)
> +{
> +	struct pnfs_block_short_extent *be;
> +	unsigned int i = 0;
> +
> +	dprintk("****************\n");
> +	dprintk("Extent list looks like:\n");
> +	list_for_each_entry(be, list, bse_node) {
> +		i++;
> +		print_short_extent(be);
> +	}
> +	if (i != count)
> +		dprintk("\n\nExpected %u entries\n\n\n", count);
> +	dprintk("****************\n");
> +}
> +
>  static void print_bl_extent(struct pnfs_block_extent *be)
>  {
>  	dprintk("PRINT EXTENT extent %p\n", be);
> @@ -386,65 +427,67 @@ add_and_merge_extent(struct pnfs_block_layout *bl,
>  	/* Scan for proper place to insert, extending new to the left
>  	 * as much as possible.
>  	 */
> -	list_for_each_entry_safe(be, tmp, list, be_node) {
> -		if (new->be_f_offset < be->be_f_offset)
> +	list_for_each_entry_safe_reverse(be, tmp, list, be_node) {
> +		if (new->be_f_offset >= be->be_f_offset + be->be_length)
>  			break;
> -		if (end <= be->be_f_offset + be->be_length) {
> -			/* new is a subset of existing be*/
> +		if (new->be_f_offset >= be->be_f_offset) {
> +			if (end <= be->be_f_offset + be->be_length) {
> +				/* new is a subset of existing be*/
> +				if (extents_consistent(be, new)) {
> +					dprintk("%s: new is subset, ignoring\n",
> +						__func__);
> +					put_extent(new);
> +					return 0;
> +				} else {
> +					goto out_err;
> +				}
> +			} else {
> +				/* |<--   be   -->|
> +				 *          |<--   new   -->| */
> +				if (extents_consistent(be, new)) {
> +					/* extend new to fully replace be */
> +					new->be_length += new->be_f_offset -
> +						be->be_f_offset;
> +					new->be_f_offset = be->be_f_offset;
> +					new->be_v_offset = be->be_v_offset;
> +					dprintk("%s: removing %p\n", __func__, be);
> +					list_del(&be->be_node);
> +					put_extent(be);
> +				} else {
> +					goto out_err;
> +				}
> +			}
> +		} else if (end >= be->be_f_offset + be->be_length) {
> +			/* new extent overlap existing be */
>  			if (extents_consistent(be, new)) {
> -				dprintk("%s: new is subset, ignoring\n",
> -					__func__);
> -				put_extent(new);
> -				return 0;
> -			} else
> +				/* extend new to fully replace be */
> +				dprintk("%s: removing %p\n", __func__, be);
> +				list_del(&be->be_node);
> +				put_extent(be);
> +			} else {
>  				goto out_err;
> -		} else if (new->be_f_offset <=
> -				be->be_f_offset + be->be_length) {
> -			/* new overlaps or abuts existing be */
> -			if (extents_consistent(be, new)) {
> +			}
> +		} else if (end > be->be_f_offset) {
> +			/*           |<--   be   -->|
> +			 *|<--   new   -->| */
> +			if (extents_consistent(new, be)) {
>  				/* extend new to fully replace be */
> -				new->be_length += new->be_f_offset -
> -						  be->be_f_offset;
> -				new->be_f_offset = be->be_f_offset;
> -				new->be_v_offset = be->be_v_offset;
> +				new->be_length += be->be_f_offset + be->be_length -
> +					new->be_f_offset - new->be_length;
>  				dprintk("%s: removing %p\n", __func__, be);
>  				list_del(&be->be_node);
>  				put_extent(be);
> -			} else if (new->be_f_offset !=
> -				   be->be_f_offset + be->be_length)
> +			} else {
>  				goto out_err;
> +			}

why is all this introduced in this patch?

Benny

>  		}
>  	}
>  	/* Note that if we never hit the above break, be will not point to a
>  	 * valid extent.  However, in that case &be->be_node==list.
>  	 */
> -	list_add_tail(&new->be_node, &be->be_node);
> +	list_add(&new->be_node, &be->be_node);
>  	dprintk("%s: inserting new\n", __func__);
>  	print_elist(list);
> -	/* Scan forward for overlaps.  If we find any, extend new and
> -	 * remove the overlapped extent.
> -	 */
> -	be = list_prepare_entry(new, list, be_node);
> -	list_for_each_entry_safe_continue(be, tmp, list, be_node) {
> -		if (end < be->be_f_offset)
> -			break;
> -		/* new overlaps or abuts existing be */
> -		if (extents_consistent(be, new)) {
> -			if (end < be->be_f_offset + be->be_length) {
> -				/* extend new to fully cover be */
> -				end = be->be_f_offset + be->be_length;
> -				new->be_length = end - new->be_f_offset;
> -			}
> -			dprintk("%s: removing %p\n", __func__, be);
> -			list_del(&be->be_node);
> -			put_extent(be);
> -		} else if (end != be->be_f_offset) {
> -			list_del(&new->be_node);
> -			goto out_err;
> -		}
> -	}
> -	dprintk("%s: after merging\n", __func__);
> -	print_elist(list);
>  	/* STUB - The per-list consistency checks have all been done,
>  	 * should now check cross-list consistency.
>  	 */
> @@ -502,6 +545,50 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
>  	return ret;
>  }
>  
> +int
> +encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
> +			       struct xdr_stream *xdr,
> +			       const struct nfs4_layoutcommit_args *arg)
> +{
> +	struct pnfs_block_short_extent *lce, *save;
> +	unsigned int count = 0;
> +	struct list_head *ranges = &bl->bl_committing;
> +	__be32 *p, *xdr_start;
> +
> +	dprintk("%s enter\n", __func__);
> +	/* BUG - creation of bl_commit is buggy - need to wait for
> +	 * entire block to be marked WRITTEN before it can be added.
> +	 */
> +	spin_lock(&bl->bl_ext_lock);
> +	/* Want to adjust for possible truncate */
> +	/* We now want to adjust argument range */
> +
> +	/* XDR encode the ranges found */
> +	xdr_start = xdr_reserve_space(xdr, 8);
> +	if (!xdr_start)
> +		goto out;
> +	list_for_each_entry_safe(lce, save, &bl->bl_commit, bse_node) {
> +		p = xdr_reserve_space(xdr, 7 * 4 + sizeof(lce->bse_devid.data));
> +		if (!p)
> +			break;
> +		WRITE_DEVID(&lce->bse_devid);
> +		WRITE64(lce->bse_f_offset << 9);
> +		WRITE64(lce->bse_length << 9);
> +		WRITE64(0LL);
> +		WRITE32(PNFS_BLOCK_READWRITE_DATA);
> +		list_del(&lce->bse_node);
> +		list_add_tail(&lce->bse_node, ranges);
> +		bl->bl_count--;
> +		count++;
> +	}
> +	xdr_start[0] = cpu_to_be32((xdr->p - xdr_start - 1) * 4);
> +	xdr_start[1] = cpu_to_be32(count);
> +out:
> +	spin_unlock(&bl->bl_ext_lock);
> +	dprintk("%s found %i ranges\n", __func__, count);
> +	return 0;
> +}
> +
>  /* Helper function to set_to_rw that initialize a new extent */
>  static void
>  _prep_new_extent(struct pnfs_block_extent *new,

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation
  2011-06-14 14:05       ` Fred Isaman
@ 2011-06-14 15:53         ` Peng Tao
  2011-06-14 16:02           ` Fred Isaman
  0 siblings, 1 reply; 58+ messages in thread
From: Peng Tao @ 2011-06-14 15:53 UTC (permalink / raw)
  To: Fred Isaman; +Cc: tao.peng, rees, linux-nfs, honey

On Tue, Jun 14, 2011 at 10:05 PM, Fred Isaman <iisaman@netapp.com> wrote:
> On Tue, Jun 14, 2011 at 7:01 AM,  <tao.peng@emc.com> wrote:
>> Hi, Fred,
>>
>>> -----Original Message-----
>>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org]
>>> On Behalf Of Fred Isaman
>>> Sent: Monday, June 13, 2011 10:44 PM
>>> To: Jim Rees
>>> Cc: linux-nfs@vger.kernel.org; peter honeyman
>>> Subject: Re: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver
>>> manipulation
>>>
>>> On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
>>> > From: Peng Tao <bergwolf@gmail.com>
>>> >
>>> > Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>>> > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>>> > Reported-by: Alexandros Batsakis <batsakis@netapp.com>
>>> > Signed-off-by: Andy Adamson <andros@netapp.com>
>>> > Signed-off-by: Fred Isaman <iisaman@netapp.com>
>>> > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>>> > Signed-off-by: Peng Tao <bergwolf@gmail.com>
>>> > ---
>>> >  fs/nfs/file.c          |   26 ++++++++++-
>>> >  fs/nfs/pnfs.c          |   41 +++++++++++++++++
>>> >  fs/nfs/pnfs.h          |  115
>>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>> >  fs/nfs/write.c         |   12 +++--
>>> >  include/linux/nfs_fs.h |    3 +-
>>> >  5 files changed, 189 insertions(+), 8 deletions(-)
>>> >
>>> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>>> > index 2f093ed..1768762 100644
>>> > --- a/fs/nfs/file.c
>>> > +++ b/fs/nfs/file.c
>>> > @@ -384,12 +384,15 @@ static int nfs_write_begin(struct file *file, struct
>>> address_space *mapping,
>>> >        pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>>> >        struct page *page;
>>> >        int once_thru = 0;
>>> > +       struct pnfs_layout_segment *lseg;
>>> >
>>> >        dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
>>> >                file->f_path.dentry->d_parent->d_name.name,
>>> >                file->f_path.dentry->d_name.name,
>>> >                mapping->host->i_ino, len, (long long) pos);
>>> > -
>>> > +       lseg = pnfs_update_layout(mapping->host,
>>> > +                                 nfs_file_open_context(file),
>>> > +                                 pos, len, IOMODE_RW, GFP_NOFS);
>>>
>>>
>>> This looks like it is left over from before the rearrangements done to
>>> where pnfs_update_layout.
>>> In particular, we don't want to hold the reference on the lseg from
>>> here until flush time.  And there
>>> seems to be no reason to.  If the client needs a layout to deal with
>>> read-in here, it should instead
>>> trigger the nfs_want_read_modify_write clause.
>> Yes, you are right. Directly calling pnfs_update_layout here can be avoided.
>> But it seems triggering nfs_want_read_modify_write will acquire a read-only layout segment via readpage code path.
>> For write, client will need a read-write layout segment so it would mean two layoutget for each new segment (one in nfs_readpage and one at flush time). It may not be good for performance.
>> Does current generic code have method to avoid this?
>>
>> Thanks,
>> Tao
>>
>
> No.  However, note that this only hits in the case where you are doing
> subpage writes.
block layout driver need the segment to determine if it should dirty
other pages in the same fsblock based on if it is a first write to an
INVALID extent. So it still hits whenever an fsblock is dirtied for
the first time.

Thanks,
Tao

>
> Fred
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation
  2011-06-14 15:53         ` Peng Tao
@ 2011-06-14 16:02           ` Fred Isaman
  0 siblings, 0 replies; 58+ messages in thread
From: Fred Isaman @ 2011-06-14 16:02 UTC (permalink / raw)
  To: Peng Tao; +Cc: tao.peng, rees, linux-nfs, honey

On Tue, Jun 14, 2011 at 11:53 AM, Peng Tao <bergwolf@gmail.com> wrote:
> On Tue, Jun 14, 2011 at 10:05 PM, Fred Isaman <iisaman@netapp.com> wrote:
>> On Tue, Jun 14, 2011 at 7:01 AM,  <tao.peng@emc.com> wrote:
>>> Hi, Fred,
>>>
>>>> -----Original Message-----
>>>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org]
>>>> On Behalf Of Fred Isaman
>>>> Sent: Monday, June 13, 2011 10:44 PM
>>>> To: Jim Rees
>>>> Cc: linux-nfs@vger.kernel.org; peter honeyman
>>>> Subject: Re: [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver
>>>> manipulation
>>>>
>>>> On Sun, Jun 12, 2011 at 7:43 PM, Jim Rees <rees@umich.edu> wrote:
>>>> > From: Peng Tao <bergwolf@gmail.com>
>>>> >
>>>> > Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>>>> > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>>>> > Reported-by: Alexandros Batsakis <batsakis@netapp.com>
>>>> > Signed-off-by: Andy Adamson <andros@netapp.com>
>>>> > Signed-off-by: Fred Isaman <iisaman@netapp.com>
>>>> > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>>>> > Signed-off-by: Peng Tao <bergwolf@gmail.com>
>>>> > ---
>>>> >  fs/nfs/file.c          |   26 ++++++++++-
>>>> >  fs/nfs/pnfs.c          |   41 +++++++++++++++++
>>>> >  fs/nfs/pnfs.h          |  115
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >  fs/nfs/write.c         |   12 +++--
>>>> >  include/linux/nfs_fs.h |    3 +-
>>>> >  5 files changed, 189 insertions(+), 8 deletions(-)
>>>> >
>>>> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>>>> > index 2f093ed..1768762 100644
>>>> > --- a/fs/nfs/file.c
>>>> > +++ b/fs/nfs/file.c
>>>> > @@ -384,12 +384,15 @@ static int nfs_write_begin(struct file *file, struct
>>>> address_space *mapping,
>>>> >        pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>>>> >        struct page *page;
>>>> >        int once_thru = 0;
>>>> > +       struct pnfs_layout_segment *lseg;
>>>> >
>>>> >        dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
>>>> >                file->f_path.dentry->d_parent->d_name.name,
>>>> >                file->f_path.dentry->d_name.name,
>>>> >                mapping->host->i_ino, len, (long long) pos);
>>>> > -
>>>> > +       lseg = pnfs_update_layout(mapping->host,
>>>> > +                                 nfs_file_open_context(file),
>>>> > +                                 pos, len, IOMODE_RW, GFP_NOFS);
>>>>
>>>>
>>>> This looks like it is left over from before the rearrangements done to
>>>> where pnfs_update_layout.
>>>> In particular, we don't want to hold the reference on the lseg from
>>>> here until flush time.  And there
>>>> seems to be no reason to.  If the client needs a layout to deal with
>>>> read-in here, it should instead
>>>> trigger the nfs_want_read_modify_write clause.
>>> Yes, you are right. Directly calling pnfs_update_layout here can be avoided.
>>> But it seems triggering nfs_want_read_modify_write will acquire a read-only layout segment via readpage code path.
>>> For write, client will need a read-write layout segment so it would mean two layoutget for each new segment (one in nfs_readpage and one at flush time). It may not be good for performance.
>>> Does current generic code have method to avoid this?
>>>
>>> Thanks,
>>> Tao
>>>
>>
>> No.  However, note that this only hits in the case where you are doing
>> subpage writes.
> block layout driver need the segment to determine if it should dirty
> other pages in the same fsblock based on if it is a first write to an
> INVALID extent. So it still hits whenever an fsblock is dirtied for
> the first time.
>
> Thanks,
> Tao
>

Why can't this be delayed until flush?

Fred

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 34/34] NFS41: do not update isize if inode needs layoutcommit
  2011-06-12 23:45 ` [PATCH 34/34] NFS41: do not update isize if inode needs layoutcommit Jim Rees
@ 2011-06-14 16:15   ` Benny Halevy
  2011-06-14 16:22     ` Fred Isaman
  0 siblings, 1 reply; 58+ messages in thread
From: Benny Halevy @ 2011-06-14 16:15 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs, peter honeyman, Trond Myklebust

reminder: this is a generic patch that should be pushed upstream
separately.

Benny

On 2011-06-12 19:45, Jim Rees wrote:
> From: Peng Tao <bergwolf@gmail.com>
> 
> Layout commit is supposed to set server file size similiar to nfs pages.
> We should not update client file size for the same reason.
> Otherwise we will lose what we have at hand.
> 
> Signed-off-by: Peng Tao <peng_tao@emc.com>
> ---
>  fs/nfs/inode.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 144f2a3..3f1eb81 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -1294,7 +1294,8 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
>  		if (new_isize != cur_isize) {
>  			/* Do we perhaps have any outstanding writes, or has
>  			 * the file grown beyond our last write? */
> -			if (nfsi->npages == 0 || new_isize > cur_isize) {
> +			if ((nfsi->npages == 0 && !test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) ||
> +			     new_isize > cur_isize) {
>  				i_size_write(inode, new_isize);
>  				invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
>  			}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 34/34] NFS41: do not update isize if inode needs layoutcommit
  2011-06-14 16:15   ` Benny Halevy
@ 2011-06-14 16:22     ` Fred Isaman
  0 siblings, 0 replies; 58+ messages in thread
From: Fred Isaman @ 2011-06-14 16:22 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Jim Rees, linux-nfs, peter honeyman, Trond Myklebust

On Tue, Jun 14, 2011 at 12:15 PM, Benny Halevy <bhalevy.lists@gmail.com> wrote:
> reminder: this is a generic patch that should be pushed upstream
> separately.
>
> Benny
>

And which in fact is currently in Trond's  bugfixes branch

Fred

> On 2011-06-12 19:45, Jim Rees wrote:
>> From: Peng Tao <bergwolf@gmail.com>
>>
>> Layout commit is supposed to set server file size similiar to nfs pages.
>> We should not update client file size for the same reason.
>> Otherwise we will lose what we have at hand.
>>
>> Signed-off-by: Peng Tao <peng_tao@emc.com>
>> ---
>>  fs/nfs/inode.c |    3 ++-
>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 144f2a3..3f1eb81 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -1294,7 +1294,8 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
>>               if (new_isize != cur_isize) {
>>                       /* Do we perhaps have any outstanding writes, or has
>>                        * the file grown beyond our last write? */
>> -                     if (nfsi->npages == 0 || new_isize > cur_isize) {
>> +                     if ((nfsi->npages == 0 && !test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) ||
>> +                          new_isize > cur_isize) {
>>                               i_size_write(inode, new_isize);
>>                               invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
>>                       }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2011-06-14 16:22 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-12 23:43 [PATCH 00/34] pnfs block layout driver based on v3.0-rc2 Jim Rees
2011-06-12 23:43 ` [PATCH 01/34] pnfs: GETDEVICELIST Jim Rees
2011-06-12 23:43 ` [PATCH 02/34] pnfs: add set-clear layoutdriver interface Jim Rees
2011-06-12 23:43 ` [PATCH 03/34] pnfs: let layoutcommit code handle multiple segments Jim Rees
2011-06-13 14:36   ` Fred Isaman
2011-06-14 10:40     ` tao.peng
2011-06-14 13:58       ` Fred Isaman
2011-06-14 14:28       ` Benny Halevy
2011-06-12 23:43 ` [PATCH 04/34] pnfs: hook nfs_write_begin/end to allow layout driver manipulation Jim Rees
2011-06-13 14:44   ` Fred Isaman
2011-06-14 11:01     ` tao.peng
2011-06-14 14:05       ` Fred Isaman
2011-06-14 15:53         ` Peng Tao
2011-06-14 16:02           ` Fred Isaman
2011-06-12 23:43 ` [PATCH 05/34] pnfs: ask for layout_blksize and save it in nfs_server Jim Rees
2011-06-14 15:01   ` Benny Halevy
2011-06-14 15:08     ` Peng Tao
2011-06-12 23:44 ` [PATCH 06/34] pnfs: cleanup_layoutcommit Jim Rees
2011-06-13 21:19   ` Benny Halevy
2011-06-14 15:16     ` Peng Tao
2011-06-14 15:10   ` Benny Halevy
2011-06-14 15:21     ` Peng Tao
2011-06-14 15:19   ` Benny Halevy
2011-06-12 23:44 ` [PATCH 07/34] pnfsblock: define PNFS_BLOCK Kconfig option Jim Rees
2011-06-14 15:13   ` Benny Halevy
2011-06-12 23:44 ` [PATCH 08/34] pnfsblock: blocklayout stub Jim Rees
2011-06-12 23:44 ` [PATCH 09/34] pnfsblock: layout alloc and free Jim Rees
2011-06-12 23:44 ` [PATCH 10/34] Add support for simple rpc pipefs Jim Rees
2011-06-12 23:44 ` [PATCH 11/34] pnfs-block: Add block device discovery pipe Jim Rees
2011-06-12 23:44 ` [PATCH 12/34] pnfsblock: basic extent code Jim Rees
2011-06-12 23:44 ` [PATCH 13/34] pnfsblock: add device operations Jim Rees
2011-06-12 23:44 ` [PATCH 14/34] pnfsblock: remove " Jim Rees
2011-06-12 23:44 ` [PATCH 15/34] pnfsblock: lseg alloc and free Jim Rees
2011-06-12 23:44 ` [PATCH 16/34] pnfsblock: merge extents Jim Rees
2011-06-12 23:44 ` [PATCH 17/34] pnfsblock: call and parse getdevicelist Jim Rees
2011-06-14 15:36   ` Benny Halevy
2011-06-12 23:44 ` [PATCH 18/34] pnfsblock: allow use of PG_owner_priv_1 flag Jim Rees
2011-06-13 15:56   ` Fred Isaman
2011-06-12 23:44 ` [PATCH 19/34] pnfsblock: xdr decode pnfs_block_layout4 Jim Rees
2011-06-12 23:44 ` [PATCH 20/34] pnfsblock: find_get_extent Jim Rees
2011-06-12 23:44 ` [PATCH 21/34] pnfsblock: SPLITME: add extent manipulation functions Jim Rees
2011-06-14 15:40   ` Benny Halevy
2011-06-12 23:44 ` [PATCH 22/34] pnfsblock: merge rw extents Jim Rees
2011-06-12 23:44 ` [PATCH 23/34] pnfsblock: encode_layoutcommit Jim Rees
2011-06-14 15:44   ` Benny Halevy
2011-06-12 23:44 ` [PATCH 24/34] pnfsblock: cleanup_layoutcommit Jim Rees
2011-06-12 23:44 ` [PATCH 25/34] pnfsblock: bl_read_pagelist Jim Rees
2011-06-12 23:44 ` [PATCH 26/34] pnfsblock: write_begin Jim Rees
2011-06-12 23:44 ` [PATCH 27/34] pnfsblock: write_end Jim Rees
2011-06-12 23:44 ` [PATCH 28/34] pnfsblock: write_end_cleanup Jim Rees
2011-06-12 23:45 ` [PATCH 29/34] pnfsblock: bl_write_pagelist support functions Jim Rees
2011-06-12 23:45 ` [PATCH 30/34] pnfsblock: bl_write_pagelist Jim Rees
2011-06-12 23:45 ` [PATCH 31/34] pnfsblock: note written INVAL areas for layoutcommit Jim Rees
2011-06-12 23:45 ` [PATCH 32/34] pnfsblock: Implement release_inval_marks Jim Rees
2011-06-12 23:45 ` [PATCH 33/34] Add configurable prefetch size for layoutget Jim Rees
2011-06-12 23:45 ` [PATCH 34/34] NFS41: do not update isize if inode needs layoutcommit Jim Rees
2011-06-14 16:15   ` Benny Halevy
2011-06-14 16:22     ` Fred Isaman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).