linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/27] add block layout driver to pnfs client
@ 2011-07-28 17:30 Jim Rees
  2011-07-28 17:30 ` [PATCH v4 01/27] pnfs: GETDEVICELIST Jim Rees
                   ` (27 more replies)
  0 siblings, 28 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

This patch set adds a block layout driver to the pNFS client.  It passes
Connectathon tests and is bisectable.  It requires an updated version of
nfs-utils, and patches for that have been sent separately to the nfs-utils
maintainer.

This patch set is also available on the for-trond branch of the git repo at
git://citi.umich.edu/projects/linux-pnfs-blk.git

This is version 4 of this patch set.  Changes since version 3:

- Rebase to trond/nfs-for-3.1 (commit ed1e6211a0).
- Change return value NFS4ERR_REP_TOO_BIG to EIO in decode_getdevicelist.
- fix error message in set_pnfs_layoutdriver.
- remove dont_like_caller from read_pagelist and write_pagelist.
- rename mark_for_commit to bl_mark_for_commit
- split the patch "NFS41: Let layoutcommit handle multiple segments" into 4
  patches each does one task only.
- fix race in saving cred in layout header.
- move the patch "NFS41: save layoutcommit cred after first successful
  layoutget" to an earlier place.
- "[PATCH v3 25/25] NFS41: Drop lseg ref before fallthru to MDS" is dropped.   
- move GETDEVICELIST to end of nfs4_procedures[]

Changes since version 2:

- Rebase to trond/nfs-for-next (commit ed1e6211a0).
- Fix Fred's and Benny's email addresses.
- Use FIXME to flag code that needs more work.
- Remove obsolete comments.
- Minor patch re-orgs per reviewer comments.

Changes since version 1:

NFS41: Drop lseg ref before fallthru to MDS
SQUASHME: pnfsblock: get rid of vmap and deviceid->area structure
SQUASHME: pnfsblock: define module alias
SQUASHME: pnfsblock: bl_find_get_extent optimization: mv break clause to end of loop
SQUASHME: pnfsblock: test debug bit once for multiple dprintks
SQUASHME: pnfsblock: typo
SQUASHME: pnfsblock: get rid of unused leftovers from device mapping removal

Andy Adamson (2):
  pnfs: GETDEVICELIST
  pnfs: cleanup_layoutcommit

Benny Halevy (2):
  pnfs: add set-clear layoutdriver interface
  pnfsblock: use pageio_ops api

Fred Isaman (15):
  pnfs: ask for layout_blksize and save it in nfs_server
  pnfsblock: add blocklayout Kconfig option, Makefile, and stubs
  pnfsblock: basic extent code
  pnfsblock: lseg alloc and free
  pnfsblock: merge extents
  pnfsblock: call and parse getdevicelist
  pnfsblock: xdr decode pnfs_block_layout4
  pnfsblock: bl_find_get_extent
  pnfsblock: add extent manipulation functions
  pnfsblock: merge rw extents
  pnfsblock: encode_layoutcommit
  pnfsblock: cleanup_layoutcommit
  pnfsblock: bl_read_pagelist
  pnfsblock: bl_write_pagelist
  pnfsblock: note written INVAL areas for layoutcommit

Jim Rees (2):
  pnfsblock: add device operations
  pnfsblock: remove device operations

Peng Tao (6):
  pnfs: save layoutcommit lwb at layout header
  pnfs: save layoutcommit cred at layout header
  pnfs: let layoutcommit handle a list of lseg
  pnfs: use lwb as layoutcommit length
  NFS41: save layoutcommit cred in layout header init
  pnfsblock: write_pagelist handle zero invalid extents

 fs/nfs/Kconfig                      |    8 +-
 fs/nfs/Makefile                     |    1 +
 fs/nfs/blocklayout/Makefile         |    5 +
 fs/nfs/blocklayout/blocklayout.c    | 1018 +++++++++++++++++++++++++++++++++++
 fs/nfs/blocklayout/blocklayout.h    |  208 +++++++
 fs/nfs/blocklayout/blocklayoutdev.c |  410 ++++++++++++++
 fs/nfs/blocklayout/blocklayoutdm.c  |  111 ++++
 fs/nfs/blocklayout/extents.c        |  943 ++++++++++++++++++++++++++++++++
 fs/nfs/client.c                     |   11 +-
 fs/nfs/nfs4_fs.h                    |    2 +-
 fs/nfs/nfs4filelayout.c             |    2 +-
 fs/nfs/nfs4proc.c                   |   62 ++-
 fs/nfs/nfs4xdr.c                    |  233 ++++++++-
 fs/nfs/pnfs.c                       |   88 ++--
 fs/nfs/pnfs.h                       |   30 +-
 include/linux/nfs.h                 |    2 +
 include/linux/nfs4.h                |    1 +
 include/linux/nfs_fs_sb.h           |    4 +-
 include/linux/nfs_xdr.h             |   17 +-
 19 files changed, 3089 insertions(+), 67 deletions(-)
 create mode 100644 fs/nfs/blocklayout/Makefile
 create mode 100644 fs/nfs/blocklayout/blocklayout.c
 create mode 100644 fs/nfs/blocklayout/blocklayout.h
 create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c
 create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c
 create mode 100644 fs/nfs/blocklayout/extents.c

-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v4 01/27] pnfs: GETDEVICELIST
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 17:30 ` [PATCH v4 02/27] pnfs: add set-clear layoutdriver interface Jim Rees
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Andy Adamson <andros@netapp.com>

The block driver uses GETDEVICELIST

Signed-off-by: Andy Adamson <andros@netapp.com>
[pass struct nfs_server * to getdevicelist]
[get machince creds for getdevicelist]
[fix getdevicelist decode sizing]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/nfs4proc.c       |   48 +++++++++++++++++
 fs/nfs/nfs4xdr.c        |  131 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/pnfs.h           |   12 ++++
 include/linux/nfs4.h    |    1 +
 include/linux/nfs_xdr.h |   11 ++++
 5 files changed, 203 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 079614d..ebb6f1a 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5834,6 +5834,54 @@ int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp)
 	return status;
 }
 
+/*
+ * Retrieve the list of Data Server devices from the MDS.
+ */
+static int _nfs4_getdevicelist(struct nfs_server *server,
+				    const struct nfs_fh *fh,
+				    struct pnfs_devicelist *devlist)
+{
+	struct nfs4_getdevicelist_args args = {
+		.fh = fh,
+		.layoutclass = server->pnfs_curr_ld->id,
+	};
+	struct nfs4_getdevicelist_res res = {
+		.devlist = devlist,
+	};
+	struct rpc_message msg = {
+		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_GETDEVICELIST],
+		.rpc_argp = &args,
+		.rpc_resp = &res,
+	};
+	int status;
+
+	dprintk("--> %s\n", __func__);
+	status = nfs4_call_sync(server->client, server, &msg, &args.seq_args,
+				&res.seq_res, 0);
+	dprintk("<-- %s status=%d\n", __func__, status);
+	return status;
+}
+
+int nfs4_proc_getdevicelist(struct nfs_server *server,
+			    const struct nfs_fh *fh,
+			    struct pnfs_devicelist *devlist)
+{
+	struct nfs4_exception exception = { };
+	int err;
+
+	do {
+		err = nfs4_handle_exception(server,
+				_nfs4_getdevicelist(server, fh, devlist),
+				&exception);
+	} while (exception.retry);
+
+	dprintk("%s: err=%d, num_devs=%u\n", __func__,
+		err, devlist->num_devs);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(nfs4_proc_getdevicelist);
+
 static int
 _nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
 {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index c191a9b..0a1a7b8 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -314,6 +314,17 @@ static int nfs4_stat_to_errno(int);
 				XDR_QUADLEN(NFS4_MAX_SESSIONID_LEN) + 5)
 #define encode_reclaim_complete_maxsz	(op_encode_hdr_maxsz + 4)
 #define decode_reclaim_complete_maxsz	(op_decode_hdr_maxsz + 4)
+#define encode_getdevicelist_maxsz (op_encode_hdr_maxsz + 4 + \
+				encode_verifier_maxsz)
+#define decode_getdevicelist_maxsz (op_decode_hdr_maxsz + \
+				2 /* nfs_cookie4 gdlr_cookie */ + \
+				decode_verifier_maxsz \
+				  /* verifier4 gdlr_verifier */ + \
+				1 /* gdlr_deviceid_list count */ + \
+				XDR_QUADLEN(NFS4_PNFS_GETDEVLIST_MAXNUM * \
+					    NFS4_DEVICEID4_SIZE) \
+				  /* gdlr_deviceid_list */ + \
+				1 /* bool gdlr_eof */)
 #define encode_getdeviceinfo_maxsz (op_encode_hdr_maxsz + 4 + \
 				XDR_QUADLEN(NFS4_DEVICEID4_SIZE))
 #define decode_getdeviceinfo_maxsz (op_decode_hdr_maxsz + \
@@ -748,6 +759,14 @@ static int nfs4_stat_to_errno(int);
 #define NFS4_dec_reclaim_complete_sz	(compound_decode_hdr_maxsz + \
 					 decode_sequence_maxsz + \
 					 decode_reclaim_complete_maxsz)
+#define NFS4_enc_getdevicelist_sz (compound_encode_hdr_maxsz + \
+				encode_sequence_maxsz + \
+				encode_putfh_maxsz + \
+				encode_getdevicelist_maxsz)
+#define NFS4_dec_getdevicelist_sz (compound_decode_hdr_maxsz + \
+				decode_sequence_maxsz + \
+				decode_putfh_maxsz + \
+				decode_getdevicelist_maxsz)
 #define NFS4_enc_getdeviceinfo_sz (compound_encode_hdr_maxsz +    \
 				encode_sequence_maxsz +\
 				encode_getdeviceinfo_maxsz)
@@ -1855,6 +1874,26 @@ static void encode_sequence(struct xdr_stream *xdr,
 
 #ifdef CONFIG_NFS_V4_1
 static void
+encode_getdevicelist(struct xdr_stream *xdr,
+		     const struct nfs4_getdevicelist_args *args,
+		     struct compound_hdr *hdr)
+{
+	__be32 *p;
+	nfs4_verifier dummy = {
+		.data = "dummmmmy",
+	};
+
+	p = reserve_space(xdr, 20);
+	*p++ = cpu_to_be32(OP_GETDEVICELIST);
+	*p++ = cpu_to_be32(args->layoutclass);
+	*p++ = cpu_to_be32(NFS4_PNFS_GETDEVLIST_MAXNUM);
+	xdr_encode_hyper(p, 0ULL);                          /* cookie */
+	encode_nfs4_verifier(xdr, &dummy);
+	hdr->nops++;
+	hdr->replen += decode_getdevicelist_maxsz;
+}
+
+static void
 encode_getdeviceinfo(struct xdr_stream *xdr,
 		     const struct nfs4_getdeviceinfo_args *args,
 		     struct compound_hdr *hdr)
@@ -2775,6 +2814,24 @@ static void nfs4_xdr_enc_reclaim_complete(struct rpc_rqst *req,
 }
 
 /*
+ * Encode GETDEVICELIST request
+ */
+static void nfs4_xdr_enc_getdevicelist(struct rpc_rqst *req,
+				       struct xdr_stream *xdr,
+				       struct nfs4_getdevicelist_args *args)
+{
+	struct compound_hdr hdr = {
+		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
+	};
+
+	encode_compound_hdr(xdr, req, &hdr);
+	encode_sequence(xdr, &args->seq_args, &hdr);
+	encode_putfh(xdr, args->fh, &hdr);
+	encode_getdevicelist(xdr, args, &hdr);
+	encode_nops(&hdr);
+}
+
+/*
  * Encode GETDEVICEINFO request
  */
 static void nfs4_xdr_enc_getdeviceinfo(struct rpc_rqst *req,
@@ -5268,6 +5325,53 @@ out_overflow:
 }
 
 #if defined(CONFIG_NFS_V4_1)
+/*
+ * TODO: Need to handle case when EOF != true;
+ */
+static int decode_getdevicelist(struct xdr_stream *xdr,
+				struct pnfs_devicelist *res)
+{
+	__be32 *p;
+	int status, i;
+	struct nfs_writeverf verftemp;
+
+	status = decode_op_hdr(xdr, OP_GETDEVICELIST);
+	if (status)
+		return status;
+
+	p = xdr_inline_decode(xdr, 8 + 8 + 4);
+	if (unlikely(!p))
+		goto out_overflow;
+
+	/* TODO: Skip cookie for now */
+	p += 2;
+
+	/* Read verifier */
+	p = xdr_decode_opaque_fixed(p, verftemp.verifier, 8);
+
+	res->num_devs = be32_to_cpup(p);
+
+	dprintk("%s: num_dev %d\n", __func__, res->num_devs);
+
+	if (res->num_devs > NFS4_PNFS_GETDEVLIST_MAXNUM) {
+		printk(KERN_ERR "%s too many result dev_num %u\n",
+				__func__, res->num_devs);
+		return -EIO;
+	}
+
+	p = xdr_inline_decode(xdr,
+			      res->num_devs * NFS4_DEVICEID4_SIZE + 4);
+	if (unlikely(!p))
+		goto out_overflow;
+	for (i = 0; i < res->num_devs; i++)
+		p = xdr_decode_opaque_fixed(p, res->dev_id[i].data,
+					    NFS4_DEVICEID4_SIZE);
+	res->eof = be32_to_cpup(p);
+	return 0;
+out_overflow:
+	print_overflow_msg(__func__, xdr);
+	return -EIO;
+}
 
 static int decode_getdeviceinfo(struct xdr_stream *xdr,
 				struct pnfs_device *pdev)
@@ -6542,6 +6646,32 @@ static int nfs4_xdr_dec_reclaim_complete(struct rpc_rqst *rqstp,
 }
 
 /*
+ * Decode GETDEVICELIST response
+ */
+static int nfs4_xdr_dec_getdevicelist(struct rpc_rqst *rqstp,
+				      struct xdr_stream *xdr,
+				      struct nfs4_getdevicelist_res *res)
+{
+	struct compound_hdr hdr;
+	int status;
+
+	dprintk("encoding getdevicelist!\n");
+
+	status = decode_compound_hdr(xdr, &hdr);
+	if (status != 0)
+		goto out;
+	status = decode_sequence(xdr, &res->seq_res, rqstp);
+	if (status != 0)
+		goto out;
+	status = decode_putfh(xdr);
+	if (status != 0)
+		goto out;
+	status = decode_getdevicelist(xdr, res->devlist);
+out:
+	return status;
+}
+
+/*
  * Decode GETDEVINFO response
  */
 static int nfs4_xdr_dec_getdeviceinfo(struct rpc_rqst *rqstp,
@@ -6908,6 +7038,7 @@ struct rpc_procinfo	nfs4_procedures[] = {
 	PROC(SECINFO_NO_NAME,	enc_secinfo_no_name,	dec_secinfo_no_name),
 	PROC(TEST_STATEID,	enc_test_stateid,	dec_test_stateid),
 	PROC(FREE_STATEID,	enc_free_stateid,	dec_free_stateid),
+	PROC(GETDEVICELIST,	enc_getdevicelist,	dec_getdevicelist),
 #endif /* CONFIG_NFS_V4_1 */
 };
 
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 078670d..ffea314 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -133,14 +133,26 @@ struct pnfs_device {
 	unsigned int  layout_type;
 	unsigned int  mincount;
 	struct page **pages;
+	void          *area;
 	unsigned int  pgbase;
 	unsigned int  pglen;
 };
 
+#define NFS4_PNFS_GETDEVLIST_MAXNUM 16
+
+struct pnfs_devicelist {
+	unsigned int		eof;
+	unsigned int		num_devs;
+	struct nfs4_deviceid	dev_id[NFS4_PNFS_GETDEVLIST_MAXNUM];
+};
+
 extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
 extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
 
 /* nfs4proc.c */
+extern int nfs4_proc_getdevicelist(struct nfs_server *server,
+				   const struct nfs_fh *fh,
+				   struct pnfs_devicelist *devlist);
 extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
 				   struct pnfs_device *dev);
 extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index a3c4bc8..76f99e8 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -566,6 +566,7 @@ enum {
 	NFSPROC4_CLNT_SECINFO_NO_NAME,
 	NFSPROC4_CLNT_TEST_STATEID,
 	NFSPROC4_CLNT_FREE_STATEID,
+	NFSPROC4_CLNT_GETDEVICELIST,
 };
 
 /* nfs41 types */
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 5b11595..a07b682 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -235,6 +235,17 @@ struct nfs4_layoutget {
 	gfp_t gfp_flags;
 };
 
+struct nfs4_getdevicelist_args {
+	const struct nfs_fh *fh;
+	u32 layoutclass;
+	struct nfs4_sequence_args seq_args;
+};
+
+struct nfs4_getdevicelist_res {
+	struct pnfs_devicelist *devlist;
+	struct nfs4_sequence_res seq_res;
+};
+
 struct nfs4_getdeviceinfo_args {
 	struct pnfs_device *pdev;
 	struct nfs4_sequence_args seq_args;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 02/27] pnfs: add set-clear layoutdriver interface
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
  2011-07-28 17:30 ` [PATCH v4 01/27] pnfs: GETDEVICELIST Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 17:30 ` [PATCH v4 03/27] pnfs: save layoutcommit lwb at layout header Jim Rees
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Benny Halevy <bhalevy@panasas.com>

To allow layout driver to issue getdevicelist at mount time, and clean up
at umount time.

[fixup non NFS_V4_1 set_pnfs_layoutdriver definition]
[pnfs: pass mntfh down the init_pnfs path]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/client.c |    8 +++++---
 fs/nfs/pnfs.c   |   15 +++++++++++++--
 fs/nfs/pnfs.h   |    8 ++++++--
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 19ea7d9..a9b1848 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -904,7 +904,9 @@ error:
 /*
  * Load up the server record from information gained in an fsinfo record
  */
-static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *fsinfo)
+static void nfs_server_set_fsinfo(struct nfs_server *server,
+				  struct nfs_fh *mntfh,
+				  struct nfs_fsinfo *fsinfo)
 {
 	unsigned long max_rpc_payload;
 
@@ -934,7 +936,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *
 	if (server->wsize > NFS_MAX_FILE_IO_SIZE)
 		server->wsize = NFS_MAX_FILE_IO_SIZE;
 	server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	set_pnfs_layoutdriver(server, fsinfo->layouttype);
+	set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);
 
 	server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);
 
@@ -980,7 +982,7 @@ static int nfs_probe_fsinfo(struct nfs_server *server, struct nfs_fh *mntfh, str
 	if (error < 0)
 		goto out_error;
 
-	nfs_server_set_fsinfo(server, &fsinfo);
+	nfs_server_set_fsinfo(server, mntfh, &fsinfo);
 
 	/* Get some general file system info */
 	if (server->namelen == 0) {
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 38e5508..037d310 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -76,8 +76,11 @@ find_pnfs_driver(u32 id)
 void
 unset_pnfs_layoutdriver(struct nfs_server *nfss)
 {
-	if (nfss->pnfs_curr_ld)
+	if (nfss->pnfs_curr_ld) {
+		if (nfss->pnfs_curr_ld->clear_layoutdriver)
+			nfss->pnfs_curr_ld->clear_layoutdriver(nfss);
 		module_put(nfss->pnfs_curr_ld->owner);
+	}
 	nfss->pnfs_curr_ld = NULL;
 }
 
@@ -88,7 +91,8 @@ unset_pnfs_layoutdriver(struct nfs_server *nfss)
  * @id layout type. Zero (illegal layout type) indicates pNFS not in use.
  */
 void
-set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
+set_pnfs_layoutdriver(struct nfs_server *server, const struct nfs_fh *mntfh,
+		      u32 id)
 {
 	struct pnfs_layoutdriver_type *ld_type = NULL;
 
@@ -115,6 +119,13 @@ set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
 		goto out_no_driver;
 	}
 	server->pnfs_curr_ld = ld_type;
+	if (ld_type->set_layoutdriver
+	    && ld_type->set_layoutdriver(server, mntfh)) {
+		printk(KERN_ERR "%s: Error initializing pNFS layout driver %u.\n",
+				__func__, id);
+		module_put(ld_type->owner);
+		goto out_no_driver;
+	}
 
 	dprintk("%s: pNFS module for %u set\n", __func__, id);
 	return;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index ffea314..23d8267 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -80,6 +80,9 @@ struct pnfs_layoutdriver_type {
 	struct module *owner;
 	unsigned flags;
 
+	int (*set_layoutdriver) (struct nfs_server *, const struct nfs_fh *);
+	int (*clear_layoutdriver) (struct nfs_server *);
+
 	struct pnfs_layout_hdr * (*alloc_layout_hdr) (struct inode *inode, gfp_t gfp_flags);
 	void (*free_layout_hdr) (struct pnfs_layout_hdr *);
 
@@ -165,7 +168,7 @@ void put_lseg(struct pnfs_layout_segment *lseg);
 bool pnfs_pageio_init_read(struct nfs_pageio_descriptor *, struct inode *);
 bool pnfs_pageio_init_write(struct nfs_pageio_descriptor *, struct inode *, int);
 
-void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
+void set_pnfs_layoutdriver(struct nfs_server *, const struct nfs_fh *, u32);
 void unset_pnfs_layoutdriver(struct nfs_server *);
 void pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *, struct nfs_page *);
 int pnfs_generic_pg_readpages(struct nfs_pageio_descriptor *desc);
@@ -372,7 +375,8 @@ pnfs_roc_drain(struct inode *ino, u32 *barrier)
 	return false;
 }
 
-static inline void set_pnfs_layoutdriver(struct nfs_server *s, u32 id)
+static inline void set_pnfs_layoutdriver(struct nfs_server *s,
+					 const struct nfs_fh *mntfh, u32 id);
 {
 }
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 03/27] pnfs: save layoutcommit lwb at layout header
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
  2011-07-28 17:30 ` [PATCH v4 01/27] pnfs: GETDEVICELIST Jim Rees
  2011-07-28 17:30 ` [PATCH v4 02/27] pnfs: add set-clear layoutdriver interface Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 17:30 ` [PATCH v4 04/27] pnfs: save layoutcommit cred " Jim Rees
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Peng Tao <bergwolf@gmail.com>

No need to save it for every lseg.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/nfs4filelayout.c |    2 +-
 fs/nfs/pnfs.c           |   10 ++++++----
 fs/nfs/pnfs.h           |    2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index be93a62..e8915d4 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -170,7 +170,7 @@ filelayout_set_layoutcommit(struct nfs_write_data *wdata)
 
 	pnfs_set_layoutcommit(wdata);
 	dprintk("%s ionde %lu pls_end_pos %lu\n", __func__, wdata->inode->i_ino,
-		(unsigned long) wdata->lseg->pls_end_pos);
+		(unsigned long) NFS_I(wdata->inode)->layout->plh_lwb);
 }
 
 /*
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 037d310..3e0989d 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1390,9 +1390,11 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 		dprintk("%s: Set layoutcommit for inode %lu ",
 			__func__, wdata->inode->i_ino);
 	}
-	if (end_pos > wdata->lseg->pls_end_pos)
-		wdata->lseg->pls_end_pos = end_pos;
+	if (end_pos > nfsi->layout->plh_lwb)
+		nfsi->layout->plh_lwb = end_pos;
 	spin_unlock(&nfsi->vfs_inode.i_lock);
+	dprintk("%s: lseg %p end_pos %llu\n",
+		__func__, wdata->lseg, nfsi->layout->plh_lwb);
 
 	/* if pnfs_layoutcommit_inode() runs between inode locks, the next one
 	 * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
@@ -1444,9 +1446,9 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 	 */
 	lseg = pnfs_list_write_lseg(inode);
 
-	end_pos = lseg->pls_end_pos;
+	end_pos = nfsi->layout->plh_lwb;
 	cred = lseg->pls_lc_cred;
-	lseg->pls_end_pos = 0;
+	nfsi->layout->plh_lwb = 0;
 	lseg->pls_lc_cred = NULL;
 
 	memcpy(&data->args.stateid.data, nfsi->layout->plh_stateid.data,
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 23d8267..044cb3e 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -45,7 +45,6 @@ struct pnfs_layout_segment {
 	unsigned long pls_flags;
 	struct pnfs_layout_hdr *pls_layout;
 	struct rpc_cred	*pls_lc_cred; /* LAYOUTCOMMIT credential */
-	loff_t pls_end_pos; /* LAYOUTCOMMIT write end */
 };
 
 enum pnfs_try_status {
@@ -128,6 +127,7 @@ struct pnfs_layout_hdr {
 	unsigned long		plh_block_lgets; /* block LAYOUTGET if >0 */
 	u32			plh_barrier; /* ignore lower seqids */
 	unsigned long		plh_flags;
+	loff_t			plh_lwb; /* last write byte for layoutcommit */
 	struct inode		*plh_inode;
 };
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 04/27] pnfs: save layoutcommit cred at layout header
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (2 preceding siblings ...)
  2011-07-28 17:30 ` [PATCH v4 03/27] pnfs: save layoutcommit lwb at layout header Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 17:30 ` [PATCH v4 05/27] pnfs: let layoutcommit handle a list of lseg Jim Rees
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Peng Tao <bergwolf@gmail.com>

No need to save it for every lseg.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/pnfs.c |    6 +++---
 fs/nfs/pnfs.h |    2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 3e0989d..201165e 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1384,7 +1384,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 	if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
 		/* references matched in nfs4_layoutcommit_release */
 		get_lseg(wdata->lseg);
-		wdata->lseg->pls_lc_cred =
+		nfsi->layout->plh_lc_cred =
 			get_rpccred(wdata->args.context->state->owner->so_cred);
 		mark_as_dirty = true;
 		dprintk("%s: Set layoutcommit for inode %lu ",
@@ -1447,9 +1447,9 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 	lseg = pnfs_list_write_lseg(inode);
 
 	end_pos = nfsi->layout->plh_lwb;
-	cred = lseg->pls_lc_cred;
+	cred = nfsi->layout->plh_lc_cred;
 	nfsi->layout->plh_lwb = 0;
-	lseg->pls_lc_cred = NULL;
+	nfsi->layout->plh_lc_cred = NULL;
 
 	memcpy(&data->args.stateid.data, nfsi->layout->plh_stateid.data,
 		sizeof(nfsi->layout->plh_stateid.data));
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 044cb3e..ac86c36 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -44,7 +44,6 @@ struct pnfs_layout_segment {
 	atomic_t pls_refcount;
 	unsigned long pls_flags;
 	struct pnfs_layout_hdr *pls_layout;
-	struct rpc_cred	*pls_lc_cred; /* LAYOUTCOMMIT credential */
 };
 
 enum pnfs_try_status {
@@ -128,6 +127,7 @@ struct pnfs_layout_hdr {
 	u32			plh_barrier; /* ignore lower seqids */
 	unsigned long		plh_flags;
 	loff_t			plh_lwb; /* last write byte for layoutcommit */
+	struct rpc_cred		*plh_lc_cred; /* layoutcommit cred */
 	struct inode		*plh_inode;
 };
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 05/27] pnfs: let layoutcommit handle a list of lseg
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (3 preceding siblings ...)
  2011-07-28 17:30 ` [PATCH v4 04/27] pnfs: save layoutcommit cred " Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 18:52   ` Boaz Harrosh
  2011-07-28 17:30 ` [PATCH v4 06/27] pnfs: use lwb as layoutcommit length Jim Rees
                   ` (22 subsequent siblings)
  27 siblings, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Peng Tao <bergwolf@gmail.com>

There can be multiple lseg per file, so layoutcommit should be
able to handle it.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/nfs4proc.c       |    8 +++++++-
 fs/nfs/pnfs.c           |   34 +++++++++++++++++-----------------
 fs/nfs/pnfs.h           |    2 ++
 include/linux/nfs_xdr.h |    2 +-
 4 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ebb6f1a..af32d3d 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5960,9 +5960,15 @@ nfs4_layoutcommit_done(struct rpc_task *task, void *calldata)
 static void nfs4_layoutcommit_release(void *calldata)
 {
 	struct nfs4_layoutcommit_data *data = calldata;
+	struct pnfs_layout_segment *lseg, *tmp;
 
 	/* Matched by references in pnfs_set_layoutcommit */
-	put_lseg(data->lseg);
+	list_for_each_entry_safe(lseg, tmp, &data->lseg_list, pls_lc_list) {
+		list_del_init(&lseg->pls_lc_list);
+		if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT,
+				       &lseg->pls_flags))
+			put_lseg(lseg);
+	}
 	put_rpccred(data->cred);
 	kfree(data);
 }
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 201165e..e2c1eb4 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -235,6 +235,7 @@ static void
 init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
 {
 	INIT_LIST_HEAD(&lseg->pls_list);
+	INIT_LIST_HEAD(&lseg->pls_lc_list);
 	atomic_set(&lseg->pls_refcount, 1);
 	smp_mb();
 	set_bit(NFS_LSEG_VALID, &lseg->pls_flags);
@@ -1361,16 +1362,17 @@ pnfs_generic_pg_readpages(struct nfs_pageio_descriptor *desc)
 EXPORT_SYMBOL_GPL(pnfs_generic_pg_readpages);
 
 /*
- * Currently there is only one (whole file) write lseg.
+ * There can be multiple RW segments.
  */
-static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
+static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp)
 {
-	struct pnfs_layout_segment *lseg, *rv = NULL;
+	struct pnfs_layout_segment *lseg;
 
-	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
-		if (lseg->pls_range.iomode == IOMODE_RW)
-			rv = lseg;
-	return rv;
+	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
+		if (lseg->pls_range.iomode == IOMODE_RW &&
+		    test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags))
+			list_add(&lseg->pls_lc_list, listp);
+	}
 }
 
 void
@@ -1382,14 +1384,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 
 	spin_lock(&nfsi->vfs_inode.i_lock);
 	if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
-		/* references matched in nfs4_layoutcommit_release */
-		get_lseg(wdata->lseg);
+		mark_as_dirty = true;
 		nfsi->layout->plh_lc_cred =
 			get_rpccred(wdata->args.context->state->owner->so_cred);
-		mark_as_dirty = true;
 		dprintk("%s: Set layoutcommit for inode %lu ",
 			__func__, wdata->inode->i_ino);
 	}
+	if (!test_and_set_bit(NFS_LSEG_LAYOUTCOMMIT, &wdata->lseg->pls_flags)) {
+		/* references matched in nfs4_layoutcommit_release */
+		get_lseg(wdata->lseg);
+	}
 	if (end_pos > nfsi->layout->plh_lwb)
 		nfsi->layout->plh_lwb = end_pos;
 	spin_unlock(&nfsi->vfs_inode.i_lock);
@@ -1416,7 +1420,6 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 {
 	struct nfs4_layoutcommit_data *data;
 	struct nfs_inode *nfsi = NFS_I(inode);
-	struct pnfs_layout_segment *lseg;
 	struct rpc_cred *cred;
 	loff_t end_pos;
 	int status = 0;
@@ -1434,17 +1437,15 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 		goto out;
 	}
 
+	INIT_LIST_HEAD(&data->lseg_list);
 	spin_lock(&inode->i_lock);
 	if (!test_and_clear_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
 		spin_unlock(&inode->i_lock);
 		kfree(data);
 		goto out;
 	}
-	/*
-	 * Currently only one (whole file) write lseg which is referenced
-	 * in pnfs_set_layoutcommit and will be found.
-	 */
-	lseg = pnfs_list_write_lseg(inode);
+
+	pnfs_list_write_lseg(inode, &data->lseg_list);
 
 	end_pos = nfsi->layout->plh_lwb;
 	cred = nfsi->layout->plh_lc_cred;
@@ -1456,7 +1457,6 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 	spin_unlock(&inode->i_lock);
 
 	data->args.inode = inode;
-	data->lseg = lseg;
 	data->cred = cred;
 	nfs_fattr_init(&data->fattr);
 	data->args.bitmask = NFS_SERVER(inode)->cache_consistency_bitmask;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index ac86c36..bddd8b9 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -36,10 +36,12 @@
 enum {
 	NFS_LSEG_VALID = 0,	/* cleared when lseg is recalled/returned */
 	NFS_LSEG_ROC,		/* roc bit received from server */
+	NFS_LSEG_LAYOUTCOMMIT,	/* layoutcommit bit set for layoutcommit */
 };
 
 struct pnfs_layout_segment {
 	struct list_head pls_list;
+	struct list_head pls_lc_list;
 	struct pnfs_layout_range pls_range;
 	atomic_t pls_refcount;
 	unsigned long pls_flags;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index a07b682..21f333e 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -273,7 +273,7 @@ struct nfs4_layoutcommit_res {
 struct nfs4_layoutcommit_data {
 	struct rpc_task task;
 	struct nfs_fattr fattr;
-	struct pnfs_layout_segment *lseg;
+	struct list_head lseg_list;
 	struct rpc_cred *cred;
 	struct nfs4_layoutcommit_args args;
 	struct nfs4_layoutcommit_res res;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 06/27] pnfs: use lwb as layoutcommit length
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (4 preceding siblings ...)
  2011-07-28 17:30 ` [PATCH v4 05/27] pnfs: let layoutcommit handle a list of lseg Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 17:30 ` [PATCH v4 07/27] NFS41: save layoutcommit cred in layout header init Jim Rees
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Peng Tao <bergwolf@gmail.com>

Using NFS4_MAX_UINT64 will break current protocol.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/nfs4xdr.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 0a1a7b8..5f769f8 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1955,7 +1955,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
 	*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
 	/* Only whole file layouts */
 	p = xdr_encode_hyper(p, 0); /* offset */
-	p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
+	p = xdr_encode_hyper(p, args->lastbytewritten + 1);	/* length */
 	*p++ = cpu_to_be32(0); /* reclaim */
 	p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
 	*p++ = cpu_to_be32(1); /* newoffset = TRUE */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 07/27] NFS41: save layoutcommit cred in layout header init
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (5 preceding siblings ...)
  2011-07-28 17:30 ` [PATCH v4 06/27] pnfs: use lwb as layoutcommit length Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 17:30 ` [PATCH v4 08/27] pnfs: ask for layout_blksize and save it in nfs_server Jim Rees
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Peng Tao <bergwolf@gmail.com>

No need to save it at every pnfs_set_layoutcommit.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/pnfs.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e2c1eb4..3a47f7c 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -201,6 +201,7 @@ static void
 pnfs_free_layout_hdr(struct pnfs_layout_hdr *lo)
 {
 	struct pnfs_layoutdriver_type *ld = NFS_SERVER(lo->plh_inode)->pnfs_curr_ld;
+	put_rpccred(lo->plh_lc_cred);
 	return ld->alloc_layout_hdr ? ld->free_layout_hdr(lo) : kfree(lo);
 }
 
@@ -828,7 +829,9 @@ out:
 }
 
 static struct pnfs_layout_hdr *
-alloc_init_layout_hdr(struct inode *ino, gfp_t gfp_flags)
+alloc_init_layout_hdr(struct inode *ino,
+		      struct nfs_open_context *ctx,
+		      gfp_t gfp_flags)
 {
 	struct pnfs_layout_hdr *lo;
 
@@ -840,11 +843,14 @@ alloc_init_layout_hdr(struct inode *ino, gfp_t gfp_flags)
 	INIT_LIST_HEAD(&lo->plh_segs);
 	INIT_LIST_HEAD(&lo->plh_bulk_recall);
 	lo->plh_inode = ino;
+	lo->plh_lc_cred = get_rpccred(ctx->state->owner->so_cred);
 	return lo;
 }
 
 static struct pnfs_layout_hdr *
-pnfs_find_alloc_layout(struct inode *ino, gfp_t gfp_flags)
+pnfs_find_alloc_layout(struct inode *ino,
+		       struct nfs_open_context *ctx,
+		       gfp_t gfp_flags)
 {
 	struct nfs_inode *nfsi = NFS_I(ino);
 	struct pnfs_layout_hdr *new = NULL;
@@ -859,7 +865,7 @@ pnfs_find_alloc_layout(struct inode *ino, gfp_t gfp_flags)
 			return nfsi->layout;
 	}
 	spin_unlock(&ino->i_lock);
-	new = alloc_init_layout_hdr(ino, gfp_flags);
+	new = alloc_init_layout_hdr(ino, ctx, gfp_flags);
 	spin_lock(&ino->i_lock);
 
 	if (likely(nfsi->layout == NULL))	/* Won the race? */
@@ -952,7 +958,7 @@ pnfs_update_layout(struct inode *ino,
 	if (!pnfs_enabled_sb(NFS_SERVER(ino)))
 		return NULL;
 	spin_lock(&ino->i_lock);
-	lo = pnfs_find_alloc_layout(ino, gfp_flags);
+	lo = pnfs_find_alloc_layout(ino, ctx, gfp_flags);
 	if (lo == NULL) {
 		dprintk("%s ERROR: can't get pnfs_layout_hdr\n", __func__);
 		goto out_unlock;
@@ -1385,8 +1391,6 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 	spin_lock(&nfsi->vfs_inode.i_lock);
 	if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
 		mark_as_dirty = true;
-		nfsi->layout->plh_lc_cred =
-			get_rpccred(wdata->args.context->state->owner->so_cred);
 		dprintk("%s: Set layoutcommit for inode %lu ",
 			__func__, wdata->inode->i_ino);
 	}
@@ -1420,7 +1424,6 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 {
 	struct nfs4_layoutcommit_data *data;
 	struct nfs_inode *nfsi = NFS_I(inode);
-	struct rpc_cred *cred;
 	loff_t end_pos;
 	int status = 0;
 
@@ -1448,16 +1451,14 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 	pnfs_list_write_lseg(inode, &data->lseg_list);
 
 	end_pos = nfsi->layout->plh_lwb;
-	cred = nfsi->layout->plh_lc_cred;
 	nfsi->layout->plh_lwb = 0;
-	nfsi->layout->plh_lc_cred = NULL;
 
 	memcpy(&data->args.stateid.data, nfsi->layout->plh_stateid.data,
 		sizeof(nfsi->layout->plh_stateid.data));
 	spin_unlock(&inode->i_lock);
 
 	data->args.inode = inode;
-	data->cred = cred;
+	data->cred = get_rpccred(nfsi->layout->plh_lc_cred);
 	nfs_fattr_init(&data->fattr);
 	data->args.bitmask = NFS_SERVER(inode)->cache_consistency_bitmask;
 	data->res.fattr = &data->fattr;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 08/27] pnfs: ask for layout_blksize and save it in nfs_server
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (6 preceding siblings ...)
  2011-07-28 17:30 ` [PATCH v4 07/27] NFS41: save layoutcommit cred in layout header init Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 17:30 ` [PATCH v4 09/27] pnfs: cleanup_layoutcommit Jim Rees
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Block layout needs it to determine IO size.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Tao Guo <glorioustao@gmail.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/client.c           |    1 +
 fs/nfs/nfs4_fs.h          |    2 +-
 fs/nfs/nfs4proc.c         |    5 +-
 fs/nfs/nfs4xdr.c          |   99 +++++++++++++++++++++++++++++++++++++--------
 include/linux/nfs_fs_sb.h |    3 +-
 include/linux/nfs_xdr.h   |    3 +-
 6 files changed, 91 insertions(+), 22 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index a9b1848..de00a37 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -936,6 +936,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server,
 	if (server->wsize > NFS_MAX_FILE_IO_SIZE)
 		server->wsize = NFS_MAX_FILE_IO_SIZE;
 	server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	server->pnfs_blksize = fsinfo->blksize;
 	set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);
 
 	server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 1909ee8..1ec1a85 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -318,7 +318,7 @@ extern const struct nfs4_minor_version_ops *nfs_v4_minor_ops[];
 extern const u32 nfs4_fattr_bitmap[2];
 extern const u32 nfs4_statfs_bitmap[2];
 extern const u32 nfs4_pathconf_bitmap[2];
-extern const u32 nfs4_fsinfo_bitmap[2];
+extern const u32 nfs4_fsinfo_bitmap[3];
 extern const u32 nfs4_fs_locations_bitmap[2];
 
 /* nfs4renewd.c */
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index af32d3d..e86de79 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -140,12 +140,13 @@ const u32 nfs4_pathconf_bitmap[2] = {
 	0
 };
 
-const u32 nfs4_fsinfo_bitmap[2] = { FATTR4_WORD0_MAXFILESIZE
+const u32 nfs4_fsinfo_bitmap[3] = { FATTR4_WORD0_MAXFILESIZE
 			| FATTR4_WORD0_MAXREAD
 			| FATTR4_WORD0_MAXWRITE
 			| FATTR4_WORD0_LEASE_TIME,
 			FATTR4_WORD1_TIME_DELTA
-			| FATTR4_WORD1_FS_LAYOUT_TYPES
+			| FATTR4_WORD1_FS_LAYOUT_TYPES,
+			FATTR4_WORD2_LAYOUT_BLKSIZE
 };
 
 const u32 nfs4_fs_locations_bitmap[2] = {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 5f769f8..0261669 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -113,7 +113,11 @@ static int nfs4_stat_to_errno(int);
 #define encode_restorefh_maxsz  (op_encode_hdr_maxsz)
 #define decode_restorefh_maxsz  (op_decode_hdr_maxsz)
 #define encode_fsinfo_maxsz	(encode_getattr_maxsz)
-#define decode_fsinfo_maxsz	(op_decode_hdr_maxsz + 15)
+/* The 5 accounts for the PNFS attributes, and assumes that at most three
+ * layout types will be returned.
+ */
+#define decode_fsinfo_maxsz	(op_decode_hdr_maxsz + \
+				 nfs4_fattr_bitmap_maxsz + 4 + 8 + 5)
 #define encode_renew_maxsz	(op_encode_hdr_maxsz + 3)
 #define decode_renew_maxsz	(op_decode_hdr_maxsz)
 #define encode_setclientid_maxsz \
@@ -1123,6 +1127,35 @@ static void encode_getattr_two(struct xdr_stream *xdr, uint32_t bm0, uint32_t bm
 	hdr->replen += decode_getattr_maxsz;
 }
 
+static void
+encode_getattr_three(struct xdr_stream *xdr,
+		     uint32_t bm0, uint32_t bm1, uint32_t bm2,
+		     struct compound_hdr *hdr)
+{
+	__be32 *p;
+
+	p = reserve_space(xdr, 4);
+	*p = cpu_to_be32(OP_GETATTR);
+	if (bm2) {
+		p = reserve_space(xdr, 16);
+		*p++ = cpu_to_be32(3);
+		*p++ = cpu_to_be32(bm0);
+		*p++ = cpu_to_be32(bm1);
+		*p = cpu_to_be32(bm2);
+	} else if (bm1) {
+		p = reserve_space(xdr, 12);
+		*p++ = cpu_to_be32(2);
+		*p++ = cpu_to_be32(bm0);
+		*p = cpu_to_be32(bm1);
+	} else {
+		p = reserve_space(xdr, 8);
+		*p++ = cpu_to_be32(1);
+		*p = cpu_to_be32(bm0);
+	}
+	hdr->nops++;
+	hdr->replen += decode_getattr_maxsz;
+}
+
 static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
 {
 	encode_getattr_two(xdr, bitmask[0] & nfs4_fattr_bitmap[0],
@@ -1131,8 +1164,11 @@ static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct c
 
 static void encode_fsinfo(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
 {
-	encode_getattr_two(xdr, bitmask[0] & nfs4_fsinfo_bitmap[0],
-			   bitmask[1] & nfs4_fsinfo_bitmap[1], hdr);
+	encode_getattr_three(xdr,
+			     bitmask[0] & nfs4_fsinfo_bitmap[0],
+			     bitmask[1] & nfs4_fsinfo_bitmap[1],
+			     bitmask[2] & nfs4_fsinfo_bitmap[2],
+			     hdr);
 }
 
 static void encode_fs_locations(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
@@ -2643,7 +2679,7 @@ static void nfs4_xdr_enc_setclientid_confirm(struct rpc_rqst *req,
 	struct compound_hdr hdr = {
 		.nops	= 0,
 	};
-	const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
+	const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME };
 
 	encode_compound_hdr(xdr, req, &hdr);
 	encode_setclientid_confirm(xdr, arg, &hdr);
@@ -2787,7 +2823,7 @@ static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req,
 	struct compound_hdr hdr = {
 		.minorversion = nfs4_xdr_minorversion(&args->la_seq_args),
 	};
-	const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
+	const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME };
 
 	encode_compound_hdr(xdr, req, &hdr);
 	encode_sequence(xdr, &args->la_seq_args, &hdr);
@@ -3068,14 +3104,17 @@ static int decode_attr_bitmap(struct xdr_stream *xdr, uint32_t *bitmap)
 		goto out_overflow;
 	bmlen = be32_to_cpup(p);
 
-	bitmap[0] = bitmap[1] = 0;
+	bitmap[0] = bitmap[1] = bitmap[2] = 0;
 	p = xdr_inline_decode(xdr, (bmlen << 2));
 	if (unlikely(!p))
 		goto out_overflow;
 	if (bmlen > 0) {
 		bitmap[0] = be32_to_cpup(p++);
-		if (bmlen > 1)
-			bitmap[1] = be32_to_cpup(p);
+		if (bmlen > 1) {
+			bitmap[1] = be32_to_cpup(p++);
+			if (bmlen > 2)
+				bitmap[2] = be32_to_cpup(p);
+		}
 	}
 	return 0;
 out_overflow:
@@ -3107,8 +3146,9 @@ static int decode_attr_supported(struct xdr_stream *xdr, uint32_t *bitmap, uint3
 			return ret;
 		bitmap[0] &= ~FATTR4_WORD0_SUPPORTED_ATTRS;
 	} else
-		bitmask[0] = bitmask[1] = 0;
-	dprintk("%s: bitmask=%08x:%08x\n", __func__, bitmask[0], bitmask[1]);
+		bitmask[0] = bitmask[1] = bitmask[2] = 0;
+	dprintk("%s: bitmask=%08x:%08x:%08x\n", __func__,
+		bitmask[0], bitmask[1], bitmask[2]);
 	return 0;
 }
 
@@ -4162,7 +4202,7 @@ out_overflow:
 static int decode_server_caps(struct xdr_stream *xdr, struct nfs4_server_caps_res *res)
 {
 	__be32 *savep;
-	uint32_t attrlen, bitmap[2] = {0};
+	uint32_t attrlen, bitmap[3] = {0};
 	int status;
 
 	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4188,7 +4228,7 @@ xdr_error:
 static int decode_statfs(struct xdr_stream *xdr, struct nfs_fsstat *fsstat)
 {
 	__be32 *savep;
-	uint32_t attrlen, bitmap[2] = {0};
+	uint32_t attrlen, bitmap[3] = {0};
 	int status;
 
 	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4220,7 +4260,7 @@ xdr_error:
 static int decode_pathconf(struct xdr_stream *xdr, struct nfs_pathconf *pathconf)
 {
 	__be32 *savep;
-	uint32_t attrlen, bitmap[2] = {0};
+	uint32_t attrlen, bitmap[3] = {0};
 	int status;
 
 	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4360,7 +4400,7 @@ static int decode_getfattr_generic(struct xdr_stream *xdr, struct nfs_fattr *fat
 {
 	__be32 *savep;
 	uint32_t attrlen,
-		 bitmap[2] = {0};
+		 bitmap[3] = {0};
 	int status;
 
 	status = decode_op_hdr(xdr, OP_GETATTR);
@@ -4446,10 +4486,32 @@ static int decode_attr_pnfstype(struct xdr_stream *xdr, uint32_t *bitmap,
 	return status;
 }
 
+/*
+ * The prefered block size for layout directed io
+ */
+static int decode_attr_layout_blksize(struct xdr_stream *xdr, uint32_t *bitmap,
+				      uint32_t *res)
+{
+	__be32 *p;
+
+	dprintk("%s: bitmap is %x\n", __func__, bitmap[2]);
+	*res = 0;
+	if (bitmap[2] & FATTR4_WORD2_LAYOUT_BLKSIZE) {
+		p = xdr_inline_decode(xdr, 4);
+		if (unlikely(!p)) {
+			print_overflow_msg(__func__, xdr);
+			return -EIO;
+		}
+		*res = be32_to_cpup(p);
+		bitmap[2] &= ~FATTR4_WORD2_LAYOUT_BLKSIZE;
+	}
+	return 0;
+}
+
 static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
 {
 	__be32 *savep;
-	uint32_t attrlen, bitmap[2];
+	uint32_t attrlen, bitmap[3];
 	int status;
 
 	if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4477,6 +4539,9 @@ static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
 	status = decode_attr_pnfstype(xdr, bitmap, &fsinfo->layouttype);
 	if (status != 0)
 		goto xdr_error;
+	status = decode_attr_layout_blksize(xdr, bitmap, &fsinfo->blksize);
+	if (status)
+		goto xdr_error;
 
 	status = verify_attr_len(xdr, savep, attrlen);
 xdr_error:
@@ -4896,7 +4961,7 @@ static int decode_getacl(struct xdr_stream *xdr, struct rpc_rqst *req,
 {
 	__be32 *savep;
 	uint32_t attrlen,
-		 bitmap[2] = {0};
+		 bitmap[3] = {0};
 	struct kvec *iov = req->rq_rcv_buf.head;
 	int status;
 
@@ -6852,7 +6917,7 @@ out:
 int nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry,
 		       int plus)
 {
-	uint32_t bitmap[2] = {0};
+	uint32_t bitmap[3] = {0};
 	uint32_t len;
 	__be32 *p = xdr_inline_decode(xdr, 4);
 	if (unlikely(!p))
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 4faeac8..b2ea8b8 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -132,7 +132,7 @@ struct nfs_server {
 #endif
 
 #ifdef CONFIG_NFS_V4
-	u32			attr_bitmask[2];/* V4 bitmask representing the set
+	u32			attr_bitmask[3];/* V4 bitmask representing the set
 						   of attributes supported on this
 						   filesystem */
 	u32			cache_consistency_bitmask[2];
@@ -145,6 +145,7 @@ struct nfs_server {
 						   filesystem */
 	struct pnfs_layoutdriver_type  *pnfs_curr_ld; /* Active layout driver */
 	struct rpc_wait_queue	roc_rpcwaitq;
+	u32			pnfs_blksize;	/* layout_blksize attr */
 
 	/* the following fields are protected by nfs_client->cl_lock */
 	struct rb_root		state_owners;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 21f333e..94f27e5 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -122,6 +122,7 @@ struct nfs_fsinfo {
 	struct timespec		time_delta; /* server time granularity */
 	__u32			lease_time; /* in seconds */
 	__u32			layouttype; /* supported pnfs layout driver */
+	__u32			blksize; /* preferred pnfs io block size */
 };
 
 struct nfs_fsstat {
@@ -954,7 +955,7 @@ struct nfs4_server_caps_arg {
 };
 
 struct nfs4_server_caps_res {
-	u32				attr_bitmask[2];
+	u32				attr_bitmask[3];
 	u32				acl_bitmask;
 	u32				has_links;
 	u32				has_symlinks;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 09/27] pnfs: cleanup_layoutcommit
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (7 preceding siblings ...)
  2011-07-28 17:30 ` [PATCH v4 08/27] pnfs: ask for layout_blksize and save it in nfs_server Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 18:26   ` Boaz Harrosh
  2011-07-28 17:30 ` [PATCH v4 10/27] pnfsblock: add blocklayout Kconfig option, Makefile, and stubs Jim Rees
                   ` (18 subsequent siblings)
  27 siblings, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Andy Adamson <andros@netapp.com>

This gives layout driver a chance to cleanup structures they put in at
encode_layoutcommit.

Signed-off-by: Andy Adamson <andros@netapp.com>
[fixup layout header pointer for layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/nfs4proc.c       |    1 +
 fs/nfs/nfs4xdr.c        |    1 +
 fs/nfs/pnfs.c           |   10 ++++++++++
 fs/nfs/pnfs.h           |    5 +++++
 include/linux/nfs_xdr.h |    1 +
 5 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index e86de79..6cb84b4 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5963,6 +5963,7 @@ static void nfs4_layoutcommit_release(void *calldata)
 	struct nfs4_layoutcommit_data *data = calldata;
 	struct pnfs_layout_segment *lseg, *tmp;
 
+	pnfs_cleanup_layoutcommit(data->args.inode, data);
 	/* Matched by references in pnfs_set_layoutcommit */
 	list_for_each_entry_safe(lseg, tmp, &data->lseg_list, pls_lc_list) {
 		list_del_init(&lseg->pls_lc_list);
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 0261669..1dce12f 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -5599,6 +5599,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
 	int status;
 
 	status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
+	res->status = status;
 	if (status)
 		return status;
 
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 3a47f7c..c1cc216 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1411,6 +1411,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 }
 EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
 
+void pnfs_cleanup_layoutcommit(struct inode *inode,
+			       struct nfs4_layoutcommit_data *data)
+{
+	struct nfs_server *nfss = NFS_SERVER(inode);
+
+	if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
+		nfss->pnfs_curr_ld->cleanup_layoutcommit(NFS_I(inode)->layout,
+							 data);
+}
+
 /*
  * For the LAYOUT4_NFSV4_1_FILES layout type, NFS_DATA_SYNC WRITEs and
  * NFS_UNSTABLE WRITEs with a COMMIT to data servers must store enough
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index bddd8b9..f271425 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -113,6 +113,9 @@ struct pnfs_layoutdriver_type {
 				     struct xdr_stream *xdr,
 				     const struct nfs4_layoutreturn_args *args);
 
+	void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
+				      struct nfs4_layoutcommit_data *data);
+
 	void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
 				     struct xdr_stream *xdr,
 				     const struct nfs4_layoutcommit_args *args);
@@ -196,6 +199,8 @@ void pnfs_roc_release(struct inode *ino);
 void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
 bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
 void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
+void pnfs_cleanup_layoutcommit(struct inode *inode,
+			       struct nfs4_layoutcommit_data *data);
 int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
 int _pnfs_return_layout(struct inode *);
 int pnfs_ld_write_done(struct nfs_write_data *);
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 94f27e5..569ea5b 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -269,6 +269,7 @@ struct nfs4_layoutcommit_res {
 	struct nfs_fattr *fattr;
 	const struct nfs_server *server;
 	struct nfs4_sequence_res seq_res;
+	int status;
 };
 
 struct nfs4_layoutcommit_data {
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 10/27] pnfsblock: add blocklayout Kconfig option, Makefile, and stubs
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (8 preceding siblings ...)
  2011-07-28 17:30 ` [PATCH v4 09/27] pnfs: cleanup_layoutcommit Jim Rees
@ 2011-07-28 17:30 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 11/27] pnfsblock: use pageio_ops api Jim Rees
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Define a configuration variable to enable/disable compilation of the
block driver code.

Add the minimal structure for a pnfs block layout driver, and empty
list-heads that will hold the extent data

[pnfsblock: make NFS_V4_1 select PNFS_BLOCK]
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs-block: fix CONFIG_PNFS_BLOCK dependencies]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfs: move pnfs_layout_type inline in nfs_inode]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: layout alloc and free]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfs: move pnfs_layout_type inline in nfs_inode]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: define module alias]
Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/Kconfig                   |    8 ++-
 fs/nfs/Makefile                  |    1 +
 fs/nfs/blocklayout/Makefile      |    5 +
 fs/nfs/blocklayout/blocklayout.c |  175 ++++++++++++++++++++++++++++++++++++++
 fs/nfs/blocklayout/blocklayout.h |   91 ++++++++++++++++++++
 5 files changed, 279 insertions(+), 1 deletions(-)
 create mode 100644 fs/nfs/blocklayout/Makefile
 create mode 100644 fs/nfs/blocklayout/blocklayout.c
 create mode 100644 fs/nfs/blocklayout/blocklayout.h

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 2cde5d9..be02077 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -79,15 +79,21 @@ config NFS_V4_1
 	depends on NFS_FS && NFS_V4 && EXPERIMENTAL
 	select SUNRPC_BACKCHANNEL
 	select PNFS_FILE_LAYOUT
+	select PNFS_BLOCK
+	select MD
+	select BLK_DEV_DM
 	help
 	  This option enables support for minor version 1 of the NFSv4 protocol
-	  (RFC 5661) in the kernel's NFS client.
+	  (RFC 5661 and RFC 5663) in the kernel's NFS client.
 
 	  If unsure, say N.
 
 config PNFS_FILE_LAYOUT
 	tristate
 
+config PNFS_BLOCK
+	tristate
+
 config PNFS_OBJLAYOUT
 	tristate "Provide support for the pNFS Objects Layout Driver for NFSv4.1 pNFS (EXPERIMENTAL)"
 	depends on NFS_FS && NFS_V4_1 && SCSI_OSD_ULD
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6a34f7d..b58613d 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
 nfs_layout_nfsv41_files-y := nfs4filelayout.o nfs4filelayoutdev.o
 
 obj-$(CONFIG_PNFS_OBJLAYOUT) += objlayout/
+obj-$(CONFIG_PNFS_BLOCK) += blocklayout/
diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
new file mode 100644
index 0000000..6bf49cd
--- /dev/null
+++ b/fs/nfs/blocklayout/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the pNFS block layout driver kernel module
+#
+obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
+blocklayoutdriver-objs := blocklayout.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
new file mode 100644
index 0000000..55a2a95
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -0,0 +1,175 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayout.c
+ *
+ *  Module for the NFSv4.1 pNFS block layout driver.
+ *
+ *  Copyright (c) 2006 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Andy Adamson <andros@citi.umich.edu>
+ *  Fred Isaman <iisaman@umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY	NFSDBG_PNFS_LD
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Andy Adamson <andros@citi.umich.edu>");
+MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");
+
+static enum pnfs_try_status
+bl_read_pagelist(struct nfs_read_data *rdata)
+{
+	return PNFS_NOT_ATTEMPTED;
+}
+
+static enum pnfs_try_status
+bl_write_pagelist(struct nfs_write_data *wdata,
+		  int sync)
+{
+	return PNFS_NOT_ATTEMPTED;
+}
+
+/* STUB */
+static void
+release_extents(struct pnfs_block_layout *bl,
+		struct pnfs_layout_range *range)
+{
+	return;
+}
+
+/* STUB */
+static void
+release_inval_marks(struct pnfs_inval_markings *marks)
+{
+	return;
+}
+
+static void bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
+{
+	struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
+
+	dprintk("%s enter\n", __func__);
+	release_extents(bl, NULL);
+	release_inval_marks(&bl->bl_inval);
+	kfree(bl);
+}
+
+static struct pnfs_layout_hdr *bl_alloc_layout_hdr(struct inode *inode,
+						   gfp_t gfp_flags)
+{
+	struct pnfs_block_layout *bl;
+
+	dprintk("%s enter\n", __func__);
+	bl = kzalloc(sizeof(*bl), gfp_flags);
+	if (!bl)
+		return NULL;
+	spin_lock_init(&bl->bl_ext_lock);
+	INIT_LIST_HEAD(&bl->bl_extents[0]);
+	INIT_LIST_HEAD(&bl->bl_extents[1]);
+	INIT_LIST_HEAD(&bl->bl_commit);
+	INIT_LIST_HEAD(&bl->bl_committing);
+	bl->bl_count = 0;
+	bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> SECTOR_SHIFT;
+	INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
+	return &bl->bl_layout;
+}
+
+static void
+bl_free_lseg(struct pnfs_layout_segment *lseg)
+{
+}
+
+static struct pnfs_layout_segment *
+bl_alloc_lseg(struct pnfs_layout_hdr *lo,
+	      struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
+{
+	return NULL;
+}
+
+static void
+bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
+		       const struct nfs4_layoutcommit_args *arg)
+{
+}
+
+static void
+bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
+			struct nfs4_layoutcommit_data *lcdata)
+{
+}
+
+static int
+bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
+{
+	dprintk("%s enter\n", __func__);
+	return 0;
+}
+
+static int
+bl_clear_layoutdriver(struct nfs_server *server)
+{
+	dprintk("%s enter\n", __func__);
+	return 0;
+}
+
+static struct pnfs_layoutdriver_type blocklayout_type = {
+	.id				= LAYOUT_BLOCK_VOLUME,
+	.name				= "LAYOUT_BLOCK_VOLUME",
+	.read_pagelist			= bl_read_pagelist,
+	.write_pagelist			= bl_write_pagelist,
+	.alloc_layout_hdr		= bl_alloc_layout_hdr,
+	.free_layout_hdr		= bl_free_layout_hdr,
+	.alloc_lseg			= bl_alloc_lseg,
+	.free_lseg			= bl_free_lseg,
+	.encode_layoutcommit		= bl_encode_layoutcommit,
+	.cleanup_layoutcommit		= bl_cleanup_layoutcommit,
+	.set_layoutdriver		= bl_set_layoutdriver,
+	.clear_layoutdriver		= bl_clear_layoutdriver,
+};
+
+static int __init nfs4blocklayout_init(void)
+{
+	int ret;
+
+	dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);
+
+	ret = pnfs_register_layoutdriver(&blocklayout_type);
+	return ret;
+}
+
+static void __exit nfs4blocklayout_exit(void)
+{
+	dprintk("%s: NFSv4 Block Layout Driver Unregistering...\n",
+	       __func__);
+
+	pnfs_unregister_layoutdriver(&blocklayout_type);
+}
+
+MODULE_ALIAS("nfs-layouttype4-3");
+
+module_init(nfs4blocklayout_init);
+module_exit(nfs4blocklayout_exit);
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
new file mode 100644
index 0000000..0e6ae06
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -0,0 +1,91 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayout.h
+ *
+ *  Module for the NFSv4.1 pNFS block layout driver.
+ *
+ *  Copyright (c) 2006 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Andy Adamson <andros@citi.umich.edu>
+ *  Fred Isaman <iisaman@umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#ifndef FS_NFS_NFS4BLOCKLAYOUT_H
+#define FS_NFS_NFS4BLOCKLAYOUT_H
+
+#include <linux/device-mapper.h>
+#include <linux/nfs_fs.h>
+#include "../pnfs.h"
+
+enum exstate4 {
+	PNFS_BLOCK_READWRITE_DATA	= 0,
+	PNFS_BLOCK_READ_DATA		= 1,
+	PNFS_BLOCK_INVALID_DATA		= 2, /* mapped, but data is invalid */
+	PNFS_BLOCK_NONE_DATA		= 3  /* unmapped, it's a hole */
+};
+
+struct pnfs_inval_markings {
+	/* STUB */
+};
+
+/* sector_t fields are all in 512-byte sectors */
+struct pnfs_block_extent {
+	struct kref	be_refcnt;
+	struct list_head be_node;	/* link into lseg list */
+	struct nfs4_deviceid be_devid;  /* FIXME: could use device cache instead */
+	struct block_device *be_mdev;
+	sector_t	be_f_offset;	/* the starting offset in the file */
+	sector_t	be_length;	/* the size of the extent */
+	sector_t	be_v_offset;	/* the starting offset in the volume */
+	enum exstate4	be_state;	/* the state of this extent */
+	struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
+};
+
+static inline void
+INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
+{
+	/* STUB */
+}
+
+enum extentclass4 {
+	RW_EXTENT       = 0, /* READWRTE and INVAL */
+	RO_EXTENT       = 1, /* READ and NONE */
+	EXTENT_LISTS    = 2,
+};
+
+struct pnfs_block_layout {
+	struct pnfs_layout_hdr bl_layout;
+	struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
+	spinlock_t		bl_ext_lock;   /* Protects list manipulation */
+	struct list_head	bl_extents[EXTENT_LISTS]; /* R and RW extents */
+	struct list_head	bl_commit;	/* Needs layout commit */
+	struct list_head	bl_committing;	/* Layout committing */
+	unsigned int		bl_count;	/* entries in bl_commit */
+	sector_t		bl_blocksize;  /* Server blocksize in sectors */
+};
+
+static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
+{
+	return container_of(lo, struct pnfs_block_layout, bl_layout);
+}
+
+#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 11/27] pnfsblock: use pageio_ops api
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (9 preceding siblings ...)
  2011-07-28 17:30 ` [PATCH v4 10/27] pnfsblock: add blocklayout Kconfig option, Makefile, and stubs Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 12/27] pnfsblock: basic extent code Jim Rees
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Benny Halevy <bhalevy@panasas.com>

[pnfsblock: use pnfs_generic_pg_init_read/write]
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 fs/nfs/blocklayout/blocklayout.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 55a2a95..e341c62 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -136,6 +136,18 @@ bl_clear_layoutdriver(struct nfs_server *server)
 	return 0;
 }
 
+static const struct nfs_pageio_ops bl_pg_read_ops = {
+	.pg_init = pnfs_generic_pg_init_read,
+	.pg_test = pnfs_generic_pg_test,
+	.pg_doio = pnfs_generic_pg_readpages,
+};
+
+static const struct nfs_pageio_ops bl_pg_write_ops = {
+	.pg_init = pnfs_generic_pg_init_write,
+	.pg_test = pnfs_generic_pg_test,
+	.pg_doio = pnfs_generic_pg_writepages,
+};
+
 static struct pnfs_layoutdriver_type blocklayout_type = {
 	.id				= LAYOUT_BLOCK_VOLUME,
 	.name				= "LAYOUT_BLOCK_VOLUME",
@@ -149,6 +161,8 @@ static struct pnfs_layoutdriver_type blocklayout_type = {
 	.cleanup_layoutcommit		= bl_cleanup_layoutcommit,
 	.set_layoutdriver		= bl_set_layoutdriver,
 	.clear_layoutdriver		= bl_clear_layoutdriver,
+	.pg_read_ops			= &bl_pg_read_ops,
+	.pg_write_ops			= &bl_pg_write_ops,
 };
 
 static int __init nfs4blocklayout_init(void)
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 12/27] pnfsblock: basic extent code
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (10 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 11/27] pnfsblock: use pageio_ops api Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 13/27] pnfsblock: add device operations Jim Rees
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Adds structures and basic create/delete code for extents.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Zhang Jingwang <Jingwang.Zhang@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/Makefile      |    2 +-
 fs/nfs/blocklayout/blocklayout.c |   20 ++++++--
 fs/nfs/blocklayout/blocklayout.h |    1 +
 fs/nfs/blocklayout/extents.c     |   97 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 115 insertions(+), 5 deletions(-)
 create mode 100644 fs/nfs/blocklayout/extents.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 6bf49cd..5cfadf6 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
 # Makefile for the pNFS block layout driver kernel module
 #
 obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o
+blocklayoutdriver-objs := blocklayout.o extents.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e341c62..252917d 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -53,12 +53,24 @@ bl_write_pagelist(struct nfs_write_data *wdata,
 	return PNFS_NOT_ATTEMPTED;
 }
 
-/* STUB */
+/* FIXME - range ignored */
 static void
-release_extents(struct pnfs_block_layout *bl,
-		struct pnfs_layout_range *range)
+release_extents(struct pnfs_block_layout *bl, struct pnfs_layout_range *range)
 {
-	return;
+	int i;
+	struct pnfs_block_extent *be;
+
+	spin_lock(&bl->bl_ext_lock);
+	for (i = 0; i < EXTENT_LISTS; i++) {
+		while (!list_empty(&bl->bl_extents[i])) {
+			be = list_first_entry(&bl->bl_extents[i],
+					      struct pnfs_block_extent,
+					      be_node);
+			list_del(&be->be_node);
+			bl_put_extent(be);
+		}
+	}
+	spin_unlock(&bl->bl_ext_lock);
 }
 
 /* STUB */
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 0e6ae06..3fec302 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -88,4 +88,5 @@ static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
 	return container_of(lo, struct pnfs_block_layout, bl_layout);
 }
 
+void bl_put_extent(struct pnfs_block_extent *be);
 #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
new file mode 100644
index 0000000..44c3364
--- /dev/null
+++ b/fs/nfs/blocklayout/extents.c
@@ -0,0 +1,97 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayout.h
+ *
+ *  Module for the NFSv4.1 pNFS block layout driver.
+ *
+ *  Copyright (c) 2006 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Andy Adamson <andros@citi.umich.edu>
+ *  Fred Isaman <iisaman@umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+
+#include "blocklayout.h"
+#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
+
+static void print_bl_extent(struct pnfs_block_extent *be)
+{
+	dprintk("PRINT EXTENT extent %p\n", be);
+	if (be) {
+		dprintk("        be_f_offset %llu\n", (u64)be->be_f_offset);
+		dprintk("        be_length   %llu\n", (u64)be->be_length);
+		dprintk("        be_v_offset %llu\n", (u64)be->be_v_offset);
+		dprintk("        be_state    %d\n", be->be_state);
+	}
+}
+
+static void
+destroy_extent(struct kref *kref)
+{
+	struct pnfs_block_extent *be;
+
+	be = container_of(kref, struct pnfs_block_extent, be_refcnt);
+	dprintk("%s be=%p\n", __func__, be);
+	kfree(be);
+}
+
+void
+bl_put_extent(struct pnfs_block_extent *be)
+{
+	if (be) {
+		dprintk("%s enter %p (%i)\n", __func__, be,
+			atomic_read(&be->be_refcnt.refcount));
+		kref_put(&be->be_refcnt, destroy_extent);
+	}
+}
+
+struct pnfs_block_extent *alloc_extent(void)
+{
+	struct pnfs_block_extent *be;
+
+	be = kmalloc(sizeof(struct pnfs_block_extent), GFP_NOFS);
+	if (!be)
+		return NULL;
+	INIT_LIST_HEAD(&be->be_node);
+	kref_init(&be->be_refcnt);
+	be->be_inval = NULL;
+	return be;
+}
+
+struct pnfs_block_extent *
+get_extent(struct pnfs_block_extent *be)
+{
+	if (be)
+		kref_get(&be->be_refcnt);
+	return be;
+}
+
+void print_elist(struct list_head *list)
+{
+	struct pnfs_block_extent *be;
+	dprintk("****************\n");
+	dprintk("Extent list looks like:\n");
+	list_for_each_entry(be, list, be_node) {
+		print_bl_extent(be);
+	}
+	dprintk("****************\n");
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 13/27] pnfsblock: add device operations
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (11 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 12/27] pnfsblock: basic extent code Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 14/27] pnfsblock: remove " Jim Rees
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[upcall bugfixes]
Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/blocklayout/Makefile         |    2 +-
 fs/nfs/blocklayout/blocklayout.c    |   42 ++++++++
 fs/nfs/blocklayout/blocklayout.h    |   40 +++++++
 fs/nfs/blocklayout/blocklayoutdev.c |  191 +++++++++++++++++++++++++++++++++++
 fs/nfs/client.c                     |    2 +-
 include/linux/nfs.h                 |    2 +
 6 files changed, 277 insertions(+), 2 deletions(-)
 create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 5cfadf6..5bf3409 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
 # Makefile for the pNFS block layout driver kernel module
 #
 obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o extents.o
+blocklayoutdriver-objs := blocklayout.o extents.o blocklayoutdev.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 252917d..c11c105 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -31,6 +31,8 @@
  */
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
 
 #include "blocklayout.h"
 
@@ -40,6 +42,9 @@ MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Andy Adamson <andros@citi.umich.edu>");
 MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");
 
+struct dentry *bl_device_pipe;
+wait_queue_head_t bl_wq;
+
 static enum pnfs_try_status
 bl_read_pagelist(struct nfs_read_data *rdata)
 {
@@ -177,13 +182,49 @@ static struct pnfs_layoutdriver_type blocklayout_type = {
 	.pg_write_ops			= &bl_pg_write_ops,
 };
 
+static const struct rpc_pipe_ops bl_upcall_ops = {
+	.upcall		= bl_pipe_upcall,
+	.downcall	= bl_pipe_downcall,
+	.destroy_msg	= bl_pipe_destroy_msg,
+};
+
 static int __init nfs4blocklayout_init(void)
 {
+	struct vfsmount *mnt;
+	struct path path;
 	int ret;
 
 	dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);
 
 	ret = pnfs_register_layoutdriver(&blocklayout_type);
+	if (ret)
+		goto out;
+
+	init_waitqueue_head(&bl_wq);
+
+	mnt = rpc_get_mount();
+	if (IS_ERR(mnt)) {
+		ret = PTR_ERR(mnt);
+		goto out_remove;
+	}
+
+	ret = vfs_path_lookup(mnt->mnt_root,
+			      mnt,
+			      NFS_PIPE_DIRNAME, 0, &path);
+	if (ret)
+		goto out_remove;
+
+	bl_device_pipe = rpc_mkpipe(path.dentry, "blocklayout", NULL,
+				    &bl_upcall_ops, 0);
+	if (IS_ERR(bl_device_pipe)) {
+		ret = PTR_ERR(bl_device_pipe);
+		goto out_remove;
+	}
+out:
+	return ret;
+
+out_remove:
+	pnfs_unregister_layoutdriver(&blocklayout_type);
 	return ret;
 }
 
@@ -193,6 +234,7 @@ static void __exit nfs4blocklayout_exit(void)
 	       __func__);
 
 	pnfs_unregister_layoutdriver(&blocklayout_type);
+	rpc_unlink(bl_device_pipe);
 }
 
 MODULE_ALIAS("nfs-layouttype4-3");
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 3fec302..3dcc971 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -34,8 +34,16 @@
 
 #include <linux/device-mapper.h>
 #include <linux/nfs_fs.h>
+#include <linux/sunrpc/rpc_pipe_fs.h>
+
 #include "../pnfs.h"
 
+struct pnfs_block_dev {
+	struct list_head		bm_node;
+	struct nfs4_deviceid		bm_mdevid;    /* associated devid */
+	struct block_device		*bm_mdev;     /* meta device itself */
+};
+
 enum exstate4 {
 	PNFS_BLOCK_READWRITE_DATA	= 0,
 	PNFS_BLOCK_READ_DATA		= 1,
@@ -88,5 +96,37 @@ static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
 	return container_of(lo, struct pnfs_block_layout, bl_layout);
 }
 
+struct bl_dev_msg {
+	int status;
+	uint32_t major, minor;
+};
+
+struct bl_msg_hdr {
+	u8  type;
+	u16 totallen; /* length of entire message, including hdr itself */
+};
+
+extern struct dentry *bl_device_pipe;
+extern wait_queue_head_t bl_wq;
+
+#define BL_DEVICE_UMOUNT               0x0 /* Umount--delete devices */
+#define BL_DEVICE_MOUNT                0x1 /* Mount--create devices*/
+#define BL_DEVICE_REQUEST_INIT         0x0 /* Start request */
+#define BL_DEVICE_REQUEST_PROC         0x1 /* User level process succeeds */
+#define BL_DEVICE_REQUEST_ERR          0x2 /* User level process fails */
+
+/* blocklayoutdev.c */
+ssize_t bl_pipe_upcall(struct file *, struct rpc_pipe_msg *,
+		       char __user *, size_t);
+ssize_t bl_pipe_downcall(struct file *, const char __user *, size_t);
+void bl_pipe_destroy_msg(struct rpc_pipe_msg *);
+struct block_device *nfs4_blkdev_get(dev_t dev);
+int nfs4_blkdev_put(struct block_device *bdev);
+struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
+						struct pnfs_device *dev,
+						struct list_head *sdlist);
+int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
+				struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
+
 void bl_put_extent(struct pnfs_block_extent *be);
 #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
new file mode 100644
index 0000000..7e1377f
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -0,0 +1,191 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayoutdev.c
+ *
+ *  Device operations for the pnfs nfs4 file layout driver.
+ *
+ *  Copyright (c) 2006 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Andy Adamson <andros@citi.umich.edu>
+ *  Fred Isaman <iisaman@umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#include <linux/module.h>
+#include <linux/buffer_head.h> /* __bread */
+
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+#include <linux/hash.h>
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
+
+/* Open a block_device by device number. */
+struct block_device *nfs4_blkdev_get(dev_t dev)
+{
+	struct block_device *bd;
+
+	dprintk("%s enter\n", __func__);
+	bd = blkdev_get_by_dev(dev, FMODE_READ, NULL);
+	if (IS_ERR(bd))
+		goto fail;
+	return bd;
+fail:
+	dprintk("%s failed to open device : %ld\n",
+			__func__, PTR_ERR(bd));
+	return NULL;
+}
+
+/*
+ * Release the block device
+ */
+int nfs4_blkdev_put(struct block_device *bdev)
+{
+	dprintk("%s for device %d:%d\n", __func__, MAJOR(bdev->bd_dev),
+			MINOR(bdev->bd_dev));
+	return blkdev_put(bdev, FMODE_READ);
+}
+
+/*
+ * Shouldn't there be a rpc_generic_upcall() to do this for us?
+ */
+ssize_t bl_pipe_upcall(struct file *filp, struct rpc_pipe_msg *msg,
+		       char __user *dst, size_t buflen)
+{
+	char *data = (char *)msg->data + msg->copied;
+	size_t mlen = min(msg->len - msg->copied, buflen);
+	unsigned long left;
+
+	left = copy_to_user(dst, data, mlen);
+	if (left == mlen) {
+		msg->errno = -EFAULT;
+		return -EFAULT;
+	}
+
+	mlen -= left;
+	msg->copied += mlen;
+	msg->errno = 0;
+	return mlen;
+}
+
+static struct bl_dev_msg bl_mount_reply;
+
+ssize_t bl_pipe_downcall(struct file *filp, const char __user *src,
+			 size_t mlen)
+{
+	if (mlen != sizeof (struct bl_dev_msg))
+		return -EINVAL;
+
+	if (copy_from_user(&bl_mount_reply, src, mlen) != 0)
+		return -EFAULT;
+
+	wake_up(&bl_wq);
+
+	return mlen;
+}
+
+void bl_pipe_destroy_msg(struct rpc_pipe_msg *msg)
+{
+	if (msg->errno >= 0)
+		return;
+	wake_up(&bl_wq);
+}
+
+/*
+ * Decodes pnfs_block_deviceaddr4 which is XDR encoded in dev->dev_addr_buf.
+ */
+struct pnfs_block_dev *
+nfs4_blk_decode_device(struct nfs_server *server,
+		       struct pnfs_device *dev,
+		       struct list_head *sdlist)
+{
+	struct pnfs_block_dev *rv = NULL;
+	struct block_device *bd = NULL;
+	struct rpc_pipe_msg msg;
+	struct bl_msg_hdr bl_msg = {
+		.type = BL_DEVICE_MOUNT,
+		.totallen = dev->mincount,
+	};
+	uint8_t *dataptr;
+	DECLARE_WAITQUEUE(wq, current);
+	struct bl_dev_msg *reply = &bl_mount_reply;
+
+	dprintk("%s CREATING PIPEFS MESSAGE\n", __func__);
+	dprintk("%s: deviceid: %s, mincount: %d\n", __func__, dev->dev_id.data,
+		dev->mincount);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.data = kzalloc(sizeof(bl_msg) + dev->mincount, GFP_NOFS);
+	if (!msg.data) {
+		rv = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	memcpy(msg.data, &bl_msg, sizeof(bl_msg));
+	dataptr = (uint8_t *) msg.data;
+	memcpy(&dataptr[sizeof(bl_msg)], dev->area, dev->mincount);
+	msg.len = sizeof(bl_msg) + dev->mincount;
+
+	dprintk("%s CALLING USERSPACE DAEMON\n", __func__);
+	add_wait_queue(&bl_wq, &wq);
+	if (rpc_queue_upcall(bl_device_pipe->d_inode, &msg) < 0) {
+		remove_wait_queue(&bl_wq, &wq);
+		goto out;
+	}
+
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	schedule();
+	__set_current_state(TASK_RUNNING);
+	remove_wait_queue(&bl_wq, &wq);
+
+	if (reply->status != BL_DEVICE_REQUEST_PROC) {
+		dprintk("%s failed to open device: %d\n",
+			__func__, reply->status);
+		rv = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
+	bd = nfs4_blkdev_get(MKDEV(reply->major, reply->minor));
+	if (IS_ERR(bd)) {
+		dprintk("%s failed to open device : %ld\n",
+			__func__, PTR_ERR(bd));
+		goto out;
+	}
+
+	rv = kzalloc(sizeof(*rv), GFP_NOFS);
+	if (!rv) {
+		rv = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	rv->bm_mdev = bd;
+	memcpy(&rv->bm_mdevid, &dev->dev_id, sizeof(struct nfs4_deviceid));
+	dprintk("%s Created device %s with bd_block_size %u\n",
+		__func__,
+		bd->bd_disk->disk_name,
+		bd->bd_block_size);
+
+out:
+	kfree(msg.data);
+	return rv;
+}
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index de00a37..5833fbb 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -105,7 +105,7 @@ struct rpc_program nfs_program = {
 	.nrvers			= ARRAY_SIZE(nfs_version),
 	.version		= nfs_version,
 	.stats			= &nfs_rpcstat,
-	.pipe_dir_name		= "/nfs",
+	.pipe_dir_name		= NFS_PIPE_DIRNAME,
 };
 
 struct rpc_stat nfs_rpcstat = {
diff --git a/include/linux/nfs.h b/include/linux/nfs.h
index f387919..8c6ee44 100644
--- a/include/linux/nfs.h
+++ b/include/linux/nfs.h
@@ -29,6 +29,8 @@
 #define NFS_MNT_VERSION		1
 #define NFS_MNT3_VERSION	3
 
+#define NFS_PIPE_DIRNAME "/nfs"
+
 /*
  * NFS stats. The good thing with these values is that NFSv3 errors are
  * a superset of NFSv2 errors (with the exception of NFSERR_WFLUSH which
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 14/27] pnfsblock: remove device operations
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (12 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 13/27] pnfsblock: add device operations Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 15/27] pnfsblock: lseg alloc and free Jim Rees
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[upcall bugfixes]
Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/blocklayout/Makefile        |    2 +-
 fs/nfs/blocklayout/blocklayout.h   |    3 +
 fs/nfs/blocklayout/blocklayoutdm.c |  111 ++++++++++++++++++++++++++++++++++++
 3 files changed, 115 insertions(+), 1 deletions(-)
 create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 5bf3409..d581550 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
 # Makefile for the pNFS block layout driver kernel module
 #
 obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o extents.o blocklayoutdev.o
+blocklayoutdriver-objs := blocklayout.o extents.o blocklayoutdev.o blocklayoutdm.o
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 3dcc971..9b88918 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -128,5 +128,8 @@ struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
 int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 				struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
 
+/* blocklayoutdm.c */
+void free_block_dev(struct pnfs_block_dev *bdev);
+
 void bl_put_extent(struct pnfs_block_extent *be);
 #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
new file mode 100644
index 0000000..eab95f3
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -0,0 +1,111 @@
+/*
+ *  linux/fs/nfs/blocklayout/blocklayoutdm.c
+ *
+ *  Module for the NFSv4.1 pNFS block layout driver.
+ *
+ *  Copyright (c) 2007 The Regents of the University of Michigan.
+ *  All rights reserved.
+ *
+ *  Fred Isaman <iisaman@umich.edu>
+ *  Andy Adamson <andros@citi.umich.edu>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization.  if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose.  the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+
+#include <linux/genhd.h> /* gendisk - used in a dprintk*/
+#include <linux/sched.h>
+#include <linux/hash.h>
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY         NFSDBG_PNFS_LD
+
+static void dev_remove(dev_t dev)
+{
+	struct rpc_pipe_msg msg;
+	struct bl_dev_msg bl_umount_request;
+	struct bl_msg_hdr bl_msg = {
+		.type = BL_DEVICE_UMOUNT,
+		.totallen = sizeof(bl_umount_request),
+	};
+	uint8_t *dataptr;
+	DECLARE_WAITQUEUE(wq, current);
+
+	dprintk("Entering %s\n", __func__);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.data = kzalloc(1 + sizeof(bl_umount_request), GFP_NOFS);
+	if (!msg.data)
+		goto out;
+
+	memset(&bl_umount_request, 0, sizeof(bl_umount_request));
+	bl_umount_request.major = MAJOR(dev);
+	bl_umount_request.minor = MINOR(dev);
+
+	memcpy(msg.data, &bl_msg, sizeof(bl_msg));
+	dataptr = (uint8_t *) msg.data;
+	memcpy(&dataptr[sizeof(bl_msg)], &bl_umount_request, sizeof(bl_umount_request));
+	msg.len = sizeof(bl_msg) + bl_msg.totallen;
+
+	add_wait_queue(&bl_wq, &wq);
+	if (rpc_queue_upcall(bl_device_pipe->d_inode, &msg) < 0) {
+		remove_wait_queue(&bl_wq, &wq);
+		goto out;
+	}
+
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	schedule();
+	__set_current_state(TASK_RUNNING);
+	remove_wait_queue(&bl_wq, &wq);
+
+out:
+	kfree(msg.data);
+}
+
+/*
+ * Release meta device
+ */
+static void nfs4_blk_metadev_release(struct pnfs_block_dev *bdev)
+{
+	int rv;
+
+	dprintk("%s Releasing\n", __func__);
+	rv = nfs4_blkdev_put(bdev->bm_mdev);
+	if (rv)
+		printk(KERN_ERR "%s nfs4_blkdev_put returns %d\n",
+				__func__, rv);
+
+	dev_remove(bdev->bm_mdev->bd_dev);
+}
+
+void free_block_dev(struct pnfs_block_dev *bdev)
+{
+	if (bdev) {
+		if (bdev->bm_mdev) {
+			dprintk("%s Removing DM device: %d:%d\n",
+				__func__,
+				MAJOR(bdev->bm_mdev->bd_dev),
+				MINOR(bdev->bm_mdev->bd_dev));
+			nfs4_blk_metadev_release(bdev);
+		}
+		kfree(bdev);
+	}
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 15/27] pnfsblock: lseg alloc and free
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (13 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 14/27] pnfsblock: remove " Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 16/27] pnfsblock: merge extents Jim Rees
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Zhang Jingwang <Jingwang.Zhang@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayout.c    |   31 +++++++++++++++++++++++++------
 fs/nfs/blocklayout/blocklayout.h    |    6 ++++++
 fs/nfs/blocklayout/blocklayoutdev.c |    8 ++++++++
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index c11c105..96c848a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -115,16 +115,35 @@ static struct pnfs_layout_hdr *bl_alloc_layout_hdr(struct inode *inode,
 	return &bl->bl_layout;
 }
 
-static void
-bl_free_lseg(struct pnfs_layout_segment *lseg)
+static void bl_free_lseg(struct pnfs_layout_segment *lseg)
 {
+	dprintk("%s enter\n", __func__);
+	kfree(lseg);
 }
 
-static struct pnfs_layout_segment *
-bl_alloc_lseg(struct pnfs_layout_hdr *lo,
-	      struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
+/* We pretty much ignore lseg, and store all data layout wide, so we
+ * can correctly merge.
+ */
+static struct pnfs_layout_segment *bl_alloc_lseg(struct pnfs_layout_hdr *lo,
+						 struct nfs4_layoutget_res *lgr,
+						 gfp_t gfp_flags)
 {
-	return NULL;
+	struct pnfs_layout_segment *lseg;
+	int status;
+
+	dprintk("%s enter\n", __func__);
+	lseg = kzalloc(sizeof(*lseg), gfp_flags);
+	if (!lseg)
+		return ERR_PTR(-ENOMEM);
+	status = nfs4_blk_process_layoutget(lo, lgr, gfp_flags);
+	if (status) {
+		/* We don't want to call the full-blown bl_free_lseg,
+		 * since on error extents were not touched.
+		 */
+		kfree(lseg);
+		return ERR_PTR(status);
+	}
+	return lseg;
 }
 
 static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 9b88918..744c7a5 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -96,6 +96,12 @@ static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
 	return container_of(lo, struct pnfs_block_layout, bl_layout);
 }
 
+static inline struct pnfs_block_layout *
+BLK_LSEG2EXT(struct pnfs_layout_segment *lseg)
+{
+	return BLK_LO2EXT(lseg->pls_layout);
+}
+
 struct bl_dev_msg {
 	int status;
 	uint32_t major, minor;
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 7e1377f..64da33a 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -189,3 +189,11 @@ out:
 	kfree(msg.data);
 	return rv;
 }
+
+int
+nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
+			   struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
+{
+	/* STUB */
+	return -EIO;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 16/27] pnfsblock: merge extents
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (14 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 15/27] pnfsblock: lseg alloc and free Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 17/27] pnfsblock: call and parse getdevicelist Jim Rees
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Replace a stub, so that extents underlying the layouts are properly
added, merged, or ignored as necessary.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: delete the new node before put it]
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayout.h |   13 +++++
 fs/nfs/blocklayout/extents.c     |  106 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 744c7a5..4411f77 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -80,6 +80,14 @@ enum extentclass4 {
 	EXTENT_LISTS    = 2,
 };
 
+static inline int choose_list(enum exstate4 state)
+{
+	if (state == PNFS_BLOCK_READ_DATA || state == PNFS_BLOCK_NONE_DATA)
+		return RO_EXTENT;
+	else
+		return RW_EXTENT;
+}
+
 struct pnfs_block_layout {
 	struct pnfs_layout_hdr bl_layout;
 	struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
@@ -137,5 +145,10 @@ int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 /* blocklayoutdm.c */
 void free_block_dev(struct pnfs_block_dev *bdev);
 
+/* extents.c */
 void bl_put_extent(struct pnfs_block_extent *be);
+struct pnfs_block_extent *alloc_extent(void);
+int bl_add_merge_extent(struct pnfs_block_layout *bl,
+			 struct pnfs_block_extent *new);
+
 #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 44c3364..3591084 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -95,3 +95,109 @@ void print_elist(struct list_head *list)
 	}
 	dprintk("****************\n");
 }
+
+static inline int
+extents_consistent(struct pnfs_block_extent *old, struct pnfs_block_extent *new)
+{
+	/* Note this assumes new->be_f_offset >= old->be_f_offset */
+	return (new->be_state == old->be_state) &&
+		((new->be_state == PNFS_BLOCK_NONE_DATA) ||
+		 ((new->be_v_offset - old->be_v_offset ==
+		   new->be_f_offset - old->be_f_offset) &&
+		  new->be_mdev == old->be_mdev));
+}
+
+/* Adds new to appropriate list in bl, modifying new and removing existing
+ * extents as appropriate to deal with overlaps.
+ *
+ * See bl_find_get_extent for list constraints.
+ *
+ * Refcount on new is already set.  If end up not using it, or error out,
+ * need to put the reference.
+ *
+ * bl->bl_ext_lock is held by caller.
+ */
+int
+bl_add_merge_extent(struct pnfs_block_layout *bl,
+		     struct pnfs_block_extent *new)
+{
+	struct pnfs_block_extent *be, *tmp;
+	sector_t end = new->be_f_offset + new->be_length;
+	struct list_head *list;
+
+	dprintk("%s enter with be=%p\n", __func__, new);
+	print_bl_extent(new);
+	list = &bl->bl_extents[choose_list(new->be_state)];
+	print_elist(list);
+
+	/* Scan for proper place to insert, extending new to the left
+	 * as much as possible.
+	 */
+	list_for_each_entry_safe(be, tmp, list, be_node) {
+		if (new->be_f_offset < be->be_f_offset)
+			break;
+		if (end <= be->be_f_offset + be->be_length) {
+			/* new is a subset of existing be*/
+			if (extents_consistent(be, new)) {
+				dprintk("%s: new is subset, ignoring\n",
+					__func__);
+				bl_put_extent(new);
+				return 0;
+			} else
+				goto out_err;
+		} else if (new->be_f_offset <=
+				be->be_f_offset + be->be_length) {
+			/* new overlaps or abuts existing be */
+			if (extents_consistent(be, new)) {
+				/* extend new to fully replace be */
+				new->be_length += new->be_f_offset -
+						  be->be_f_offset;
+				new->be_f_offset = be->be_f_offset;
+				new->be_v_offset = be->be_v_offset;
+				dprintk("%s: removing %p\n", __func__, be);
+				list_del(&be->be_node);
+				bl_put_extent(be);
+			} else if (new->be_f_offset !=
+				   be->be_f_offset + be->be_length)
+				goto out_err;
+		}
+	}
+	/* Note that if we never hit the above break, be will not point to a
+	 * valid extent.  However, in that case &be->be_node==list.
+	 */
+	list_add_tail(&new->be_node, &be->be_node);
+	dprintk("%s: inserting new\n", __func__);
+	print_elist(list);
+	/* Scan forward for overlaps.  If we find any, extend new and
+	 * remove the overlapped extent.
+	 */
+	be = list_prepare_entry(new, list, be_node);
+	list_for_each_entry_safe_continue(be, tmp, list, be_node) {
+		if (end < be->be_f_offset)
+			break;
+		/* new overlaps or abuts existing be */
+		if (extents_consistent(be, new)) {
+			if (end < be->be_f_offset + be->be_length) {
+				/* extend new to fully cover be */
+				end = be->be_f_offset + be->be_length;
+				new->be_length = end - new->be_f_offset;
+			}
+			dprintk("%s: removing %p\n", __func__, be);
+			list_del(&be->be_node);
+			bl_put_extent(be);
+		} else if (end != be->be_f_offset) {
+			list_del(&new->be_node);
+			goto out_err;
+		}
+	}
+	dprintk("%s: after merging\n", __func__);
+	print_elist(list);
+	/* FIXME - The per-list consistency checks have all been done,
+	 * should now check cross-list consistency.
+	 */
+	return 0;
+
+ out_err:
+	bl_put_extent(new);
+	return -EIO;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 17/27] pnfsblock: call and parse getdevicelist
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (15 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 16/27] pnfsblock: merge extents Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 18/27] pnfsblock: xdr decode pnfs_block_layout4 Jim Rees
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO
for each device returned.

[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
[pnfsblock: fix pnfs_deviceid references]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix print format warnings for sector_t and size_t]
[pnfs-block: #include <linux/vmalloc.h>]
[pnfsblock: no PNFS_NFS_SERVER]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfsblock: fix bug determining size of striped volume]
[pnfsblock: fix oops when using multiple devices]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: get rid of vmap and deviceid->area structure]
Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/blocklayout/blocklayout.c    |  138 ++++++++++++++++++++++++++++++++++-
 fs/nfs/blocklayout/blocklayout.h    |   13 +++-
 fs/nfs/blocklayout/blocklayoutdev.c |   13 +++-
 fs/nfs/pnfs.h                       |    1 -
 include/linux/nfs_fs_sb.h           |    1 +
 5 files changed, 158 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 96c848a..507761e 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -158,17 +158,153 @@ bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
 {
 }
 
+static void free_blk_mountid(struct block_mount_id *mid)
+{
+	if (mid) {
+		struct pnfs_block_dev *dev;
+		spin_lock(&mid->bm_lock);
+		while (!list_empty(&mid->bm_devlist)) {
+			dev = list_first_entry(&mid->bm_devlist,
+					       struct pnfs_block_dev,
+					       bm_node);
+			list_del(&dev->bm_node);
+			free_block_dev(dev);
+		}
+		spin_unlock(&mid->bm_lock);
+		kfree(mid);
+	}
+}
+
+/* This is mostly copied from the filelayout's get_device_info function.
+ * It seems much of this should be at the generic pnfs level.
+ */
+static struct pnfs_block_dev *
+nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
+			struct nfs4_deviceid *d_id)
+{
+	struct pnfs_device *dev;
+	struct pnfs_block_dev *rv = NULL;
+	u32 max_resp_sz;
+	int max_pages;
+	struct page **pages = NULL;
+	int i, rc;
+
+	/*
+	 * Use the session max response size as the basis for setting
+	 * GETDEVICEINFO's maxcount
+	 */
+	max_resp_sz = server->nfs_client->cl_session->fc_attrs.max_resp_sz;
+	max_pages = max_resp_sz >> PAGE_SHIFT;
+	dprintk("%s max_resp_sz %u max_pages %d\n",
+		__func__, max_resp_sz, max_pages);
+
+	dev = kmalloc(sizeof(*dev), GFP_NOFS);
+	if (!dev) {
+		dprintk("%s kmalloc failed\n", __func__);
+		return NULL;
+	}
+
+	pages = kzalloc(max_pages * sizeof(struct page *), GFP_NOFS);
+	if (pages == NULL) {
+		kfree(dev);
+		return NULL;
+	}
+	for (i = 0; i < max_pages; i++) {
+		pages[i] = alloc_page(GFP_NOFS);
+		if (!pages[i])
+			goto out_free;
+	}
+
+	memcpy(&dev->dev_id, d_id, sizeof(*d_id));
+	dev->layout_type = LAYOUT_BLOCK_VOLUME;
+	dev->pages = pages;
+	dev->pgbase = 0;
+	dev->pglen = PAGE_SIZE * max_pages;
+	dev->mincount = 0;
+
+	dprintk("%s: dev_id: %s\n", __func__, dev->dev_id.data);
+	rc = nfs4_proc_getdeviceinfo(server, dev);
+	dprintk("%s getdevice info returns %d\n", __func__, rc);
+	if (rc)
+		goto out_free;
+
+	rv = nfs4_blk_decode_device(server, dev);
+ out_free:
+	for (i = 0; i < max_pages; i++)
+		__free_page(pages[i]);
+	kfree(pages);
+	kfree(dev);
+	return rv;
+}
+
 static int
 bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
 {
+	struct block_mount_id *b_mt_id = NULL;
+	struct pnfs_devicelist *dlist = NULL;
+	struct pnfs_block_dev *bdev;
+	LIST_HEAD(block_disklist);
+	int status = 0, i;
+
 	dprintk("%s enter\n", __func__);
-	return 0;
+
+	if (server->pnfs_blksize == 0) {
+		dprintk("%s Server did not return blksize\n", __func__);
+		return -EINVAL;
+	}
+	b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_NOFS);
+	if (!b_mt_id) {
+		status = -ENOMEM;
+		goto out_error;
+	}
+	/* Initialize nfs4 block layout mount id */
+	spin_lock_init(&b_mt_id->bm_lock);
+	INIT_LIST_HEAD(&b_mt_id->bm_devlist);
+
+	dlist = kmalloc(sizeof(struct pnfs_devicelist), GFP_NOFS);
+	if (!dlist) {
+		status = -ENOMEM;
+		goto out_error;
+	}
+	dlist->eof = 0;
+	while (!dlist->eof) {
+		status = nfs4_proc_getdevicelist(server, fh, dlist);
+		if (status)
+			goto out_error;
+		dprintk("%s GETDEVICELIST numdevs=%i, eof=%i\n",
+			__func__, dlist->num_devs, dlist->eof);
+		for (i = 0; i < dlist->num_devs; i++) {
+			bdev = nfs4_blk_get_deviceinfo(server, fh,
+						       &dlist->dev_id[i]);
+			if (!bdev) {
+				status = -ENODEV;
+				goto out_error;
+			}
+			spin_lock(&b_mt_id->bm_lock);
+			list_add(&bdev->bm_node, &b_mt_id->bm_devlist);
+			spin_unlock(&b_mt_id->bm_lock);
+		}
+	}
+	dprintk("%s SUCCESS\n", __func__);
+	server->pnfs_ld_data = b_mt_id;
+
+ out_return:
+	kfree(dlist);
+	return status;
+
+ out_error:
+	free_blk_mountid(b_mt_id);
+	goto out_return;
 }
 
 static int
 bl_clear_layoutdriver(struct nfs_server *server)
 {
+	struct block_mount_id *b_mt_id = server->pnfs_ld_data;
+
 	dprintk("%s enter\n", __func__);
+	free_blk_mountid(b_mt_id);
+	dprintk("%s RETURNS\n", __func__);
 	return 0;
 }
 
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 4411f77..3105b96 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -38,6 +38,11 @@
 
 #include "../pnfs.h"
 
+struct block_mount_id {
+	spinlock_t			bm_lock;    /* protects list */
+	struct list_head		bm_devlist; /* holds pnfs_block_dev */
+};
+
 struct pnfs_block_dev {
 	struct list_head		bm_node;
 	struct nfs4_deviceid		bm_mdevid;    /* associated devid */
@@ -99,7 +104,10 @@ struct pnfs_block_layout {
 	sector_t		bl_blocksize;  /* Server blocksize in sectors */
 };
 
-static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
+#define BLK_ID(lo) ((struct block_mount_id *)(NFS_SERVER(lo->plh_inode)->pnfs_ld_data))
+
+static inline struct pnfs_block_layout *
+BLK_LO2EXT(struct pnfs_layout_hdr *lo)
 {
 	return container_of(lo, struct pnfs_block_layout, bl_layout);
 }
@@ -137,8 +145,7 @@ void bl_pipe_destroy_msg(struct rpc_pipe_msg *);
 struct block_device *nfs4_blkdev_get(dev_t dev);
 int nfs4_blkdev_put(struct block_device *bdev);
 struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
-						struct pnfs_device *dev,
-						struct list_head *sdlist);
+						struct pnfs_device *dev);
 int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 				struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
 
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 64da33a..b23fe60 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -116,8 +116,7 @@ void bl_pipe_destroy_msg(struct rpc_pipe_msg *msg)
  */
 struct pnfs_block_dev *
 nfs4_blk_decode_device(struct nfs_server *server,
-		       struct pnfs_device *dev,
-		       struct list_head *sdlist)
+		       struct pnfs_device *dev)
 {
 	struct pnfs_block_dev *rv = NULL;
 	struct block_device *bd = NULL;
@@ -129,6 +128,7 @@ nfs4_blk_decode_device(struct nfs_server *server,
 	uint8_t *dataptr;
 	DECLARE_WAITQUEUE(wq, current);
 	struct bl_dev_msg *reply = &bl_mount_reply;
+	int offset, len, i;
 
 	dprintk("%s CREATING PIPEFS MESSAGE\n", __func__);
 	dprintk("%s: deviceid: %s, mincount: %d\n", __func__, dev->dev_id.data,
@@ -143,7 +143,14 @@ nfs4_blk_decode_device(struct nfs_server *server,
 
 	memcpy(msg.data, &bl_msg, sizeof(bl_msg));
 	dataptr = (uint8_t *) msg.data;
-	memcpy(&dataptr[sizeof(bl_msg)], dev->area, dev->mincount);
+	len = dev->mincount;
+	offset = sizeof(bl_msg);
+	for (i = 0; len > 0; i++) {
+		memcpy(&dataptr[offset], page_address(dev->pages[i]),
+				len < PAGE_CACHE_SIZE ? len : PAGE_CACHE_SIZE);
+		len -= PAGE_CACHE_SIZE;
+		offset += PAGE_CACHE_SIZE;
+	}
 	msg.len = sizeof(bl_msg) + dev->mincount;
 
 	dprintk("%s CALLING USERSPACE DAEMON\n", __func__);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index f271425..82dde37 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -141,7 +141,6 @@ struct pnfs_device {
 	unsigned int  layout_type;
 	unsigned int  mincount;
 	struct page **pages;
-	void          *area;
 	unsigned int  pgbase;
 	unsigned int  pglen;
 };
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index b2ea8b8..cc03fc1 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -146,6 +146,7 @@ struct nfs_server {
 	struct pnfs_layoutdriver_type  *pnfs_curr_ld; /* Active layout driver */
 	struct rpc_wait_queue	roc_rpcwaitq;
 	u32			pnfs_blksize;	/* layout_blksize attr */
+	void			*pnfs_ld_data;	/* per mount point data */
 
 	/* the following fields are protected by nfs_client->cl_lock */
 	struct rb_root		state_owners;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 18/27] pnfsblock: xdr decode pnfs_block_layout4
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (16 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 17/27] pnfsblock: call and parse getdevicelist Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 19/27] pnfsblock: bl_find_get_extent Jim Rees
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

XDR decodes the block layout payload sent in LAYOUTGET result, storing
the result in an extent list.

[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayoutdev.c |  208 ++++++++++++++++++++++++++++++++++-
 1 files changed, 206 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index b23fe60..3bf8358 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -40,6 +40,19 @@
 
 #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
 
+static int decode_sector_number(__be32 **rp, sector_t *sp)
+{
+	uint64_t s;
+
+	*rp = xdr_decode_hyper(*rp, &s);
+	if (s & 0x1ff) {
+		printk(KERN_WARNING "%s: sector not aligned\n", __func__);
+		return -1;
+	}
+	*sp = s >> SECTOR_SHIFT;
+	return 0;
+}
+
 /* Open a block_device by device number. */
 struct block_device *nfs4_blkdev_get(dev_t dev)
 {
@@ -197,10 +210,201 @@ out:
 	return rv;
 }
 
+/* Map deviceid returned by the server to constructed block_device */
+static struct block_device *translate_devid(struct pnfs_layout_hdr *lo,
+					    struct nfs4_deviceid *id)
+{
+	struct block_device *rv = NULL;
+	struct block_mount_id *mid;
+	struct pnfs_block_dev *dev;
+
+	dprintk("%s enter, lo=%p, id=%p\n", __func__, lo, id);
+	mid = BLK_ID(lo);
+	spin_lock(&mid->bm_lock);
+	list_for_each_entry(dev, &mid->bm_devlist, bm_node) {
+		if (memcmp(id->data, dev->bm_mdevid.data,
+			   NFS4_DEVICEID4_SIZE) == 0) {
+			rv = dev->bm_mdev;
+			goto out;
+		}
+	}
+ out:
+	spin_unlock(&mid->bm_lock);
+	dprintk("%s returning %p\n", __func__, rv);
+	return rv;
+}
+
+/* Tracks info needed to ensure extents in layout obey constraints of spec */
+struct layout_verification {
+	u32 mode;	/* R or RW */
+	u64 start;	/* Expected start of next non-COW extent */
+	u64 inval;	/* Start of INVAL coverage */
+	u64 cowread;	/* End of COW read coverage */
+};
+
+/* Verify the extent meets the layout requirements of the pnfs-block draft,
+ * section 2.3.1.
+ */
+static int verify_extent(struct pnfs_block_extent *be,
+			 struct layout_verification *lv)
+{
+	if (lv->mode == IOMODE_READ) {
+		if (be->be_state == PNFS_BLOCK_READWRITE_DATA ||
+		    be->be_state == PNFS_BLOCK_INVALID_DATA)
+			return -EIO;
+		if (be->be_f_offset != lv->start)
+			return -EIO;
+		lv->start += be->be_length;
+		return 0;
+	}
+	/* lv->mode == IOMODE_RW */
+	if (be->be_state == PNFS_BLOCK_READWRITE_DATA) {
+		if (be->be_f_offset != lv->start)
+			return -EIO;
+		if (lv->cowread > lv->start)
+			return -EIO;
+		lv->start += be->be_length;
+		lv->inval = lv->start;
+		return 0;
+	} else if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+		if (be->be_f_offset != lv->start)
+			return -EIO;
+		lv->start += be->be_length;
+		return 0;
+	} else if (be->be_state == PNFS_BLOCK_READ_DATA) {
+		if (be->be_f_offset > lv->start)
+			return -EIO;
+		if (be->be_f_offset < lv->inval)
+			return -EIO;
+		if (be->be_f_offset < lv->cowread)
+			return -EIO;
+		/* It looks like you might want to min this with lv->start,
+		 * but you really don't.
+		 */
+		lv->inval = lv->inval + be->be_length;
+		lv->cowread = be->be_f_offset + be->be_length;
+		return 0;
+	} else
+		return -EIO;
+}
+
+/* XDR decode pnfs_block_layout4 structure */
 int
 nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 			   struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
 {
-	/* STUB */
-	return -EIO;
+	struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
+	int i, status = -EIO;
+	uint32_t count;
+	struct pnfs_block_extent *be = NULL, *save;
+	struct xdr_stream stream;
+	struct xdr_buf buf;
+	struct page *scratch;
+	__be32 *p;
+	struct layout_verification lv = {
+		.mode = lgr->range.iomode,
+		.start = lgr->range.offset >> SECTOR_SHIFT,
+		.inval = lgr->range.offset >> SECTOR_SHIFT,
+		.cowread = lgr->range.offset >> SECTOR_SHIFT,
+	};
+	LIST_HEAD(extents);
+
+	dprintk("---> %s\n", __func__);
+
+	scratch = alloc_page(gfp_flags);
+	if (!scratch)
+		return -ENOMEM;
+
+	xdr_init_decode_pages(&stream, &buf, lgr->layoutp->pages, lgr->layoutp->len);
+	xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
+
+	p = xdr_inline_decode(&stream, 4);
+	if (unlikely(!p))
+		goto out_err;
+
+	count = be32_to_cpup(p++);
+
+	dprintk("%s enter, number of extents %i\n", __func__, count);
+	p = xdr_inline_decode(&stream, (28 + NFS4_DEVICEID4_SIZE) * count);
+	if (unlikely(!p))
+		goto out_err;
+
+	/* Decode individual extents, putting them in temporary
+	 * staging area until whole layout is decoded to make error
+	 * recovery easier.
+	 */
+	for (i = 0; i < count; i++) {
+		be = alloc_extent();
+		if (!be) {
+			status = -ENOMEM;
+			goto out_err;
+		}
+		memcpy(&be->be_devid, p, NFS4_DEVICEID4_SIZE);
+		p += XDR_QUADLEN(NFS4_DEVICEID4_SIZE);
+		be->be_mdev = translate_devid(lo, &be->be_devid);
+		if (!be->be_mdev)
+			goto out_err;
+
+		/* The next three values are read in as bytes,
+		 * but stored as 512-byte sector lengths
+		 */
+		if (decode_sector_number(&p, &be->be_f_offset) < 0)
+			goto out_err;
+		if (decode_sector_number(&p, &be->be_length) < 0)
+			goto out_err;
+		if (decode_sector_number(&p, &be->be_v_offset) < 0)
+			goto out_err;
+		be->be_state = be32_to_cpup(p++);
+		if (be->be_state == PNFS_BLOCK_INVALID_DATA)
+			be->be_inval = &bl->bl_inval;
+		if (verify_extent(be, &lv)) {
+			dprintk("%s verify failed\n", __func__);
+			goto out_err;
+		}
+		list_add_tail(&be->be_node, &extents);
+	}
+	if (lgr->range.offset + lgr->range.length !=
+			lv.start << SECTOR_SHIFT) {
+		dprintk("%s Final length mismatch\n", __func__);
+		be = NULL;
+		goto out_err;
+	}
+	if (lv.start < lv.cowread) {
+		dprintk("%s Final uncovered COW extent\n", __func__);
+		be = NULL;
+		goto out_err;
+	}
+	/* Extents decoded properly, now try to merge them in to
+	 * existing layout extents.
+	 */
+	spin_lock(&bl->bl_ext_lock);
+	list_for_each_entry_safe(be, save, &extents, be_node) {
+		list_del(&be->be_node);
+		status = bl_add_merge_extent(bl, be);
+		if (status) {
+			spin_unlock(&bl->bl_ext_lock);
+			/* This is a fairly catastrophic error, as the
+			 * entire layout extent lists are now corrupted.
+			 * We should have some way to distinguish this.
+			 */
+			be = NULL;
+			goto out_err;
+		}
+	}
+	spin_unlock(&bl->bl_ext_lock);
+	status = 0;
+ out:
+	__free_page(scratch);
+	dprintk("%s returns %i\n", __func__, status);
+	return status;
+
+ out_err:
+	bl_put_extent(be);
+	while (!list_empty(&extents)) {
+		be = list_first_entry(&extents, struct pnfs_block_extent,
+				      be_node);
+		list_del(&be->be_node);
+		bl_put_extent(be);
+	}
+	goto out;
 }
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 19/27] pnfsblock: bl_find_get_extent
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (17 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 18/27] pnfsblock: xdr decode pnfs_block_layout4 Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 20/27] pnfsblock: add extent manipulation functions Jim Rees
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Implement bl_find_get_extent(), one of the core extent manipulation
routines.

[pnfsblock: Lookup list entry of layouts and tags in reverse order]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>

pnfsblock: fix print format warnings for sector_t and size_t

gcc spews warnings about these on x86_64, e.g.:
fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’
fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayout.h |    3 ++
 fs/nfs/blocklayout/extents.c     |   47 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 3105b96..25c3153 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -153,6 +153,9 @@ int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
 void free_block_dev(struct pnfs_block_dev *bdev);
 
 /* extents.c */
+struct pnfs_block_extent *
+bl_find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
+		struct pnfs_block_extent **cow_read);
 void bl_put_extent(struct pnfs_block_extent *be);
 struct pnfs_block_extent *alloc_extent(void);
 int bl_add_merge_extent(struct pnfs_block_layout *bl,
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 3591084..c306616 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -201,3 +201,50 @@ bl_add_merge_extent(struct pnfs_block_layout *bl,
 	bl_put_extent(new);
 	return -EIO;
 }
+
+/* Returns extent, or NULL.  If a second READ extent exists, it is returned
+ * in cow_read, if given.
+ *
+ * The extents are kept in two seperate ordered lists, one for READ and NONE,
+ * one for READWRITE and INVALID.  Within each list, we assume:
+ * 1. Extents are ordered by file offset.
+ * 2. For any given isect, there is at most one extents that matches.
+ */
+struct pnfs_block_extent *
+bl_find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
+	    struct pnfs_block_extent **cow_read)
+{
+	struct pnfs_block_extent *be, *cow, *ret;
+	int i;
+
+	dprintk("%s enter with isect %llu\n", __func__, (u64)isect);
+	cow = ret = NULL;
+	spin_lock(&bl->bl_ext_lock);
+	for (i = 0; i < EXTENT_LISTS; i++) {
+		list_for_each_entry_reverse(be, &bl->bl_extents[i], be_node) {
+			if (isect >= be->be_f_offset + be->be_length)
+				break;
+			if (isect >= be->be_f_offset) {
+				/* We have found an extent */
+				dprintk("%s Get %p (%i)\n", __func__, be,
+					atomic_read(&be->be_refcnt.refcount));
+				kref_get(&be->be_refcnt);
+				if (!ret)
+					ret = be;
+				else if (be->be_state != PNFS_BLOCK_READ_DATA)
+					bl_put_extent(be);
+				else
+					cow = be;
+				break;
+			}
+		}
+		if (ret &&
+		    (!cow_read || ret->be_state != PNFS_BLOCK_INVALID_DATA))
+			break;
+	}
+	spin_unlock(&bl->bl_ext_lock);
+	if (cow_read)
+		*cow_read = cow;
+	print_bl_extent(ret);
+	return ret;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 20/27] pnfsblock: add extent manipulation functions
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (18 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 19/27] pnfsblock: bl_find_get_extent Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 21/27] pnfsblock: merge rw extents Jim Rees
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Adds working implementations of various support functions
to handle INVAL extents, needed by writes, such as
bl_mark_sectors_init and is_sector_initialized.

[pnfsblock: fix 64-bit compiler warnings for extent manipulation]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[Implement release_inval_marks]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
---
 fs/nfs/blocklayout/blocklayout.c |    7 +-
 fs/nfs/blocklayout/blocklayout.h |   31 +++++-
 fs/nfs/blocklayout/extents.c     |  253 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 288 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 507761e..c8db55e 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -78,10 +78,15 @@ release_extents(struct pnfs_block_layout *bl, struct pnfs_layout_range *range)
 	spin_unlock(&bl->bl_ext_lock);
 }
 
-/* STUB */
 static void
 release_inval_marks(struct pnfs_inval_markings *marks)
 {
+	struct pnfs_inval_tracking *pos, *temp;
+
+	list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
+		list_del(&pos->it_link);
+		kfree(pos);
+	}
 	return;
 }
 
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 25c3153..c002aa2 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -38,6 +38,9 @@
 
 #include "../pnfs.h"
 
+#define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> SECTOR_SHIFT)
+#define PAGE_CACHE_SECTOR_SHIFT (PAGE_CACHE_SHIFT - SECTOR_SHIFT)
+
 struct block_mount_id {
 	spinlock_t			bm_lock;    /* protects list */
 	struct list_head		bm_devlist; /* holds pnfs_block_dev */
@@ -56,8 +59,23 @@ enum exstate4 {
 	PNFS_BLOCK_NONE_DATA		= 3  /* unmapped, it's a hole */
 };
 
+#define MY_MAX_TAGS (15) /* tag bitnums used must be less than this */
+
+struct my_tree {
+	sector_t		mtt_step_size;	/* Internal sector alignment */
+	struct list_head	mtt_stub; /* Should be a radix tree */
+};
+
 struct pnfs_inval_markings {
-	/* STUB */
+	spinlock_t	im_lock;
+	struct my_tree	im_tree;	/* Sectors that need LAYOUTCOMMIT */
+	sector_t	im_block_size;	/* Server blocksize in sectors */
+};
+
+struct pnfs_inval_tracking {
+	struct list_head it_link;
+	int		 it_sector;
+	int		 it_tags;
 };
 
 /* sector_t fields are all in 512-byte sectors */
@@ -76,7 +94,11 @@ struct pnfs_block_extent {
 static inline void
 INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
 {
-	/* STUB */
+	spin_lock_init(&marks->im_lock);
+	INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
+	marks->im_block_size = blocksize;
+	marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
+					   blocksize);
 }
 
 enum extentclass4 {
@@ -156,8 +178,13 @@ void free_block_dev(struct pnfs_block_dev *bdev);
 struct pnfs_block_extent *
 bl_find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
 		struct pnfs_block_extent **cow_read);
+int bl_mark_sectors_init(struct pnfs_inval_markings *marks,
+			     sector_t offset, sector_t length,
+			     sector_t **pages);
 void bl_put_extent(struct pnfs_block_extent *be);
 struct pnfs_block_extent *alloc_extent(void);
+struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
+int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
 int bl_add_merge_extent(struct pnfs_block_layout *bl,
 			 struct pnfs_block_extent *new);
 
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index c306616..3528d36 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -33,6 +33,259 @@
 #include "blocklayout.h"
 #define NFSDBG_FACILITY         NFSDBG_PNFS_LD
 
+/* Bit numbers */
+#define EXTENT_INITIALIZED 0
+#define EXTENT_WRITTEN     1
+#define EXTENT_IN_COMMIT   2
+#define INTERNAL_EXISTS    MY_MAX_TAGS
+#define INTERNAL_MASK      ((1 << INTERNAL_EXISTS) - 1)
+
+/* Returns largest t<=s s.t. t%base==0 */
+static inline sector_t normalize(sector_t s, int base)
+{
+	sector_t tmp = s; /* Since do_div modifies its argument */
+	return s - do_div(tmp, base);
+}
+
+static inline sector_t normalize_up(sector_t s, int base)
+{
+	return normalize(s + base - 1, base);
+}
+
+/* Complete stub using list while determine API wanted */
+
+/* Returns tags, or negative */
+static int32_t _find_entry(struct my_tree *tree, u64 s)
+{
+	struct pnfs_inval_tracking *pos;
+
+	dprintk("%s(%llu) enter\n", __func__, s);
+	list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+		if (pos->it_sector > s)
+			continue;
+		else if (pos->it_sector == s)
+			return pos->it_tags & INTERNAL_MASK;
+		else
+			break;
+	}
+	return -ENOENT;
+}
+
+static inline
+int _has_tag(struct my_tree *tree, u64 s, int32_t tag)
+{
+	int32_t tags;
+
+	dprintk("%s(%llu, %i) enter\n", __func__, s, tag);
+	s = normalize(s, tree->mtt_step_size);
+	tags = _find_entry(tree, s);
+	if ((tags < 0) || !(tags & (1 << tag)))
+		return 0;
+	else
+		return 1;
+}
+
+/* Creates entry with tag, or if entry already exists, unions tag to it.
+ * If storage is not NULL, newly created entry will use it.
+ * Returns number of entries added, or negative on error.
+ */
+static int _add_entry(struct my_tree *tree, u64 s, int32_t tag,
+		      struct pnfs_inval_tracking *storage)
+{
+	int found = 0;
+	struct pnfs_inval_tracking *pos;
+
+	dprintk("%s(%llu, %i, %p) enter\n", __func__, s, tag, storage);
+	list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+		if (pos->it_sector > s)
+			continue;
+		else if (pos->it_sector == s) {
+			found = 1;
+			break;
+		} else
+			break;
+	}
+	if (found) {
+		pos->it_tags |= (1 << tag);
+		return 0;
+	} else {
+		struct pnfs_inval_tracking *new;
+		if (storage)
+			new = storage;
+		else {
+			new = kmalloc(sizeof(*new), GFP_NOFS);
+			if (!new)
+				return -ENOMEM;
+		}
+		new->it_sector = s;
+		new->it_tags = (1 << tag);
+		list_add(&new->it_link, &pos->it_link);
+		return 1;
+	}
+}
+
+/* XXXX Really want option to not create */
+/* Over range, unions tag with existing entries, else creates entry with tag */
+static int _set_range(struct my_tree *tree, int32_t tag, u64 s, u64 length)
+{
+	u64 i;
+
+	dprintk("%s(%i, %llu, %llu) enter\n", __func__, tag, s, length);
+	for (i = normalize(s, tree->mtt_step_size); i < s + length;
+	     i += tree->mtt_step_size)
+		if (_add_entry(tree, i, tag, NULL))
+			return -ENOMEM;
+	return 0;
+}
+
+/* Ensure that future operations on given range of tree will not malloc */
+static int _preload_range(struct my_tree *tree, u64 offset, u64 length)
+{
+	u64 start, end, s;
+	int count, i, used = 0, status = -ENOMEM;
+	struct pnfs_inval_tracking **storage;
+
+	dprintk("%s(%llu, %llu) enter\n", __func__, offset, length);
+	start = normalize(offset, tree->mtt_step_size);
+	end = normalize_up(offset + length, tree->mtt_step_size);
+	count = (int)(end - start) / (int)tree->mtt_step_size;
+
+	/* Pre-malloc what memory we might need */
+	storage = kmalloc(sizeof(*storage) * count, GFP_NOFS);
+	if (!storage)
+		return -ENOMEM;
+	for (i = 0; i < count; i++) {
+		storage[i] = kmalloc(sizeof(struct pnfs_inval_tracking),
+				     GFP_NOFS);
+		if (!storage[i])
+			goto out_cleanup;
+	}
+
+	/* Now need lock - HOW??? */
+
+	for (s = start; s < end; s += tree->mtt_step_size)
+		used += _add_entry(tree, s, INTERNAL_EXISTS, storage[used]);
+
+	/* Unlock - HOW??? */
+	status = 0;
+
+ out_cleanup:
+	for (i = used; i < count; i++) {
+		if (!storage[i])
+			break;
+		kfree(storage[i]);
+	}
+	kfree(storage);
+	return status;
+}
+
+static void set_needs_init(sector_t *array, sector_t offset)
+{
+	sector_t *p = array;
+
+	dprintk("%s enter\n", __func__);
+	if (!p)
+		return;
+	while (*p < offset)
+		p++;
+	if (*p == offset)
+		return;
+	else if (*p == ~0) {
+		*p++ = offset;
+		*p = ~0;
+		return;
+	} else {
+		sector_t *save = p;
+		dprintk("%s Adding %llu\n", __func__, (u64)offset);
+		while (*p != ~0)
+			p++;
+		p++;
+		memmove(save + 1, save, (char *)p - (char *)save);
+		*save = offset;
+		return;
+	}
+}
+
+/* We are relying on page lock to serialize this */
+int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
+{
+	int rv;
+
+	spin_lock(&marks->im_lock);
+	rv = _has_tag(&marks->im_tree, isect, EXTENT_INITIALIZED);
+	spin_unlock(&marks->im_lock);
+	return rv;
+}
+
+/* Marks sectors in [offest, offset_length) as having been initialized.
+ * All lengths are step-aligned, where step is min(pagesize, blocksize).
+ * Notes where partial block is initialized, and helps prepare it for
+ * complete initialization later.
+ */
+/* Currently assumes offset is page-aligned */
+int bl_mark_sectors_init(struct pnfs_inval_markings *marks,
+			     sector_t offset, sector_t length,
+			     sector_t **pages)
+{
+	sector_t s, start, end;
+	sector_t *array = NULL; /* Pages to mark */
+
+	dprintk("%s(offset=%llu,len=%llu) enter\n",
+		__func__, (u64)offset, (u64)length);
+	s = max((sector_t) 3,
+		2 * (marks->im_block_size / (PAGE_CACHE_SECTORS)));
+	dprintk("%s set max=%llu\n", __func__, (u64)s);
+	if (pages) {
+		array = kmalloc(s * sizeof(sector_t), GFP_NOFS);
+		if (!array)
+			goto outerr;
+		array[0] = ~0;
+	}
+
+	start = normalize(offset, marks->im_block_size);
+	end = normalize_up(offset + length, marks->im_block_size);
+	if (_preload_range(&marks->im_tree, start, end - start))
+		goto outerr;
+
+	spin_lock(&marks->im_lock);
+
+	for (s = normalize_up(start, PAGE_CACHE_SECTORS);
+	     s < offset; s += PAGE_CACHE_SECTORS) {
+		dprintk("%s pre-area pages\n", __func__);
+		/* Portion of used block is not initialized */
+		if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
+			set_needs_init(array, s);
+	}
+	if (_set_range(&marks->im_tree, EXTENT_INITIALIZED, offset, length))
+		goto out_unlock;
+	for (s = normalize_up(offset + length, PAGE_CACHE_SECTORS);
+	     s < end; s += PAGE_CACHE_SECTORS) {
+		dprintk("%s post-area pages\n", __func__);
+		if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
+			set_needs_init(array, s);
+	}
+
+	spin_unlock(&marks->im_lock);
+
+	if (pages) {
+		if (array[0] == ~0) {
+			kfree(array);
+			*pages = NULL;
+		} else
+			*pages = array;
+	}
+	return 0;
+
+ out_unlock:
+	spin_unlock(&marks->im_lock);
+ outerr:
+	if (pages) {
+		kfree(array);
+		*pages = NULL;
+	}
+	return -ENOMEM;
+}
+
 static void print_bl_extent(struct pnfs_block_extent *be)
 {
 	dprintk("PRINT EXTENT extent %p\n", be);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 21/27] pnfsblock: merge rw extents
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (19 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 20/27] pnfsblock: add extent manipulation functions Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 22/27] pnfsblock: encode_layoutcommit Jim Rees
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/extents.c |   47 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 3528d36..7beae7c 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -501,3 +501,50 @@ bl_find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
 	print_bl_extent(ret);
 	return ret;
 }
+
+/* Helper function to set_to_rw that initialize a new extent */
+static void
+_prep_new_extent(struct pnfs_block_extent *new,
+		 struct pnfs_block_extent *orig,
+		 sector_t offset, sector_t length, int state)
+{
+	kref_init(&new->be_refcnt);
+	/* don't need to INIT_LIST_HEAD(&new->be_node) */
+	memcpy(&new->be_devid, &orig->be_devid, sizeof(struct nfs4_deviceid));
+	new->be_mdev = orig->be_mdev;
+	new->be_f_offset = offset;
+	new->be_length = length;
+	new->be_v_offset = orig->be_v_offset - orig->be_f_offset + offset;
+	new->be_state = state;
+	new->be_inval = orig->be_inval;
+}
+
+/* Tries to merge be with extent in front of it in list.
+ * Frees storage if not used.
+ */
+static struct pnfs_block_extent *
+_front_merge(struct pnfs_block_extent *be, struct list_head *head,
+	     struct pnfs_block_extent *storage)
+{
+	struct pnfs_block_extent *prev;
+
+	if (!storage)
+		goto no_merge;
+	if (&be->be_node == head || be->be_node.prev == head)
+		goto no_merge;
+	prev = list_entry(be->be_node.prev, struct pnfs_block_extent, be_node);
+	if ((prev->be_f_offset + prev->be_length != be->be_f_offset) ||
+	    !extents_consistent(prev, be))
+		goto no_merge;
+	_prep_new_extent(storage, prev, prev->be_f_offset,
+			 prev->be_length + be->be_length, prev->be_state);
+	list_replace(&prev->be_node, &storage->be_node);
+	bl_put_extent(prev);
+	list_del(&be->be_node);
+	bl_put_extent(be);
+	return storage;
+
+ no_merge:
+	kfree(storage);
+	return be;
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 22/27] pnfsblock: encode_layoutcommit
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (20 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 21/27] pnfsblock: merge rw extents Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 23/27] pnfsblock: cleanup_layoutcommit Jim Rees
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
   extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
   and a structure is allocated for communication with
   bl_encode_layoutcommit && bl_cleanup_layoutcommit
   (Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
   encoding directly into xdr. The commit-extent-list is not
   freed and is stored on above structure.
   FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
   to set_to_rw() as before, but with no need for XDR decoding
   of the list as before. And the commit-extent-list is freed.
   Finally allocated structure is freed.

[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: prevent commit list corruption]
[pnfsblock: fix layoutcommit with an empty opaque]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayout.c |    2 +
 fs/nfs/blocklayout/blocklayout.h |   12 +++
 fs/nfs/blocklayout/extents.c     |  176 ++++++++++++++++++++++++++++----------
 3 files changed, 146 insertions(+), 44 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index c8db55e..e409f63 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -155,6 +155,8 @@ static void
 bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
 		       const struct nfs4_layoutcommit_args *arg)
 {
+	dprintk("%s enter\n", __func__);
+	encode_pnfs_block_layoutupdate(BLK_LO2EXT(lo), xdr, arg);
 }
 
 static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index c002aa2..de908da 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -91,6 +91,15 @@ struct pnfs_block_extent {
 	struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
 };
 
+/* Shortened extent used by LAYOUTCOMMIT */
+struct pnfs_block_short_extent {
+	struct list_head bse_node;
+	struct nfs4_deviceid bse_devid;
+	struct block_device *bse_mdev;
+	sector_t	bse_f_offset;	/* the starting offset in the file */
+	sector_t	bse_length;	/* the size of the extent */
+};
+
 static inline void
 INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
 {
@@ -185,6 +194,9 @@ void bl_put_extent(struct pnfs_block_extent *be);
 struct pnfs_block_extent *alloc_extent(void);
 struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
 int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
+int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+				   struct xdr_stream *xdr,
+				   const struct nfs4_layoutcommit_args *arg);
 int bl_add_merge_extent(struct pnfs_block_layout *bl,
 			 struct pnfs_block_extent *new);
 
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 7beae7c..b46c8be 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -286,6 +286,49 @@ int bl_mark_sectors_init(struct pnfs_inval_markings *marks,
 	return -ENOMEM;
 }
 
+/* Marks sectors in [offest, offset+length) as having been written to disk.
+ * All lengths should be block aligned.
+ */
+int mark_written_sectors(struct pnfs_inval_markings *marks,
+			 sector_t offset, sector_t length)
+{
+	int status;
+
+	dprintk("%s(offset=%llu,len=%llu) enter\n", __func__,
+		(u64)offset, (u64)length);
+	spin_lock(&marks->im_lock);
+	status = _set_range(&marks->im_tree, EXTENT_WRITTEN, offset, length);
+	spin_unlock(&marks->im_lock);
+	return status;
+}
+
+static void print_short_extent(struct pnfs_block_short_extent *be)
+{
+	dprintk("PRINT SHORT EXTENT extent %p\n", be);
+	if (be) {
+		dprintk("        be_f_offset %llu\n", (u64)be->bse_f_offset);
+		dprintk("        be_length   %llu\n", (u64)be->bse_length);
+	}
+}
+
+void print_clist(struct list_head *list, unsigned int count)
+{
+	struct pnfs_block_short_extent *be;
+	unsigned int i = 0;
+
+	ifdebug(FACILITY) {
+		printk(KERN_DEBUG "****************\n");
+		printk(KERN_DEBUG "Extent list looks like:\n");
+		list_for_each_entry(be, list, bse_node) {
+			i++;
+			print_short_extent(be);
+		}
+		if (i != count)
+			printk(KERN_DEBUG "\n\nExpected %u entries\n\n\n", count);
+		printk(KERN_DEBUG "****************\n");
+	}
+}
+
 static void print_bl_extent(struct pnfs_block_extent *be)
 {
 	dprintk("PRINT EXTENT extent %p\n", be);
@@ -386,65 +429,67 @@ bl_add_merge_extent(struct pnfs_block_layout *bl,
 	/* Scan for proper place to insert, extending new to the left
 	 * as much as possible.
 	 */
-	list_for_each_entry_safe(be, tmp, list, be_node) {
-		if (new->be_f_offset < be->be_f_offset)
+	list_for_each_entry_safe_reverse(be, tmp, list, be_node) {
+		if (new->be_f_offset >= be->be_f_offset + be->be_length)
 			break;
-		if (end <= be->be_f_offset + be->be_length) {
-			/* new is a subset of existing be*/
+		if (new->be_f_offset >= be->be_f_offset) {
+			if (end <= be->be_f_offset + be->be_length) {
+				/* new is a subset of existing be*/
+				if (extents_consistent(be, new)) {
+					dprintk("%s: new is subset, ignoring\n",
+						__func__);
+					bl_put_extent(new);
+					return 0;
+				} else {
+					goto out_err;
+				}
+			} else {
+				/* |<--   be   -->|
+				 *          |<--   new   -->| */
+				if (extents_consistent(be, new)) {
+					/* extend new to fully replace be */
+					new->be_length += new->be_f_offset -
+						be->be_f_offset;
+					new->be_f_offset = be->be_f_offset;
+					new->be_v_offset = be->be_v_offset;
+					dprintk("%s: removing %p\n", __func__, be);
+					list_del(&be->be_node);
+					bl_put_extent(be);
+				} else {
+					goto out_err;
+				}
+			}
+		} else if (end >= be->be_f_offset + be->be_length) {
+			/* new extent overlap existing be */
 			if (extents_consistent(be, new)) {
-				dprintk("%s: new is subset, ignoring\n",
-					__func__);
-				bl_put_extent(new);
-				return 0;
-			} else
+				/* extend new to fully replace be */
+				dprintk("%s: removing %p\n", __func__, be);
+				list_del(&be->be_node);
+				bl_put_extent(be);
+			} else {
 				goto out_err;
-		} else if (new->be_f_offset <=
-				be->be_f_offset + be->be_length) {
-			/* new overlaps or abuts existing be */
-			if (extents_consistent(be, new)) {
+			}
+		} else if (end > be->be_f_offset) {
+			/*           |<--   be   -->|
+			 *|<--   new   -->| */
+			if (extents_consistent(new, be)) {
 				/* extend new to fully replace be */
-				new->be_length += new->be_f_offset -
-						  be->be_f_offset;
-				new->be_f_offset = be->be_f_offset;
-				new->be_v_offset = be->be_v_offset;
+				new->be_length += be->be_f_offset + be->be_length -
+					new->be_f_offset - new->be_length;
 				dprintk("%s: removing %p\n", __func__, be);
 				list_del(&be->be_node);
 				bl_put_extent(be);
-			} else if (new->be_f_offset !=
-				   be->be_f_offset + be->be_length)
+			} else {
 				goto out_err;
+			}
 		}
 	}
 	/* Note that if we never hit the above break, be will not point to a
 	 * valid extent.  However, in that case &be->be_node==list.
 	 */
-	list_add_tail(&new->be_node, &be->be_node);
+	list_add(&new->be_node, &be->be_node);
 	dprintk("%s: inserting new\n", __func__);
 	print_elist(list);
-	/* Scan forward for overlaps.  If we find any, extend new and
-	 * remove the overlapped extent.
-	 */
-	be = list_prepare_entry(new, list, be_node);
-	list_for_each_entry_safe_continue(be, tmp, list, be_node) {
-		if (end < be->be_f_offset)
-			break;
-		/* new overlaps or abuts existing be */
-		if (extents_consistent(be, new)) {
-			if (end < be->be_f_offset + be->be_length) {
-				/* extend new to fully cover be */
-				end = be->be_f_offset + be->be_length;
-				new->be_length = end - new->be_f_offset;
-			}
-			dprintk("%s: removing %p\n", __func__, be);
-			list_del(&be->be_node);
-			bl_put_extent(be);
-		} else if (end != be->be_f_offset) {
-			list_del(&new->be_node);
-			goto out_err;
-		}
-	}
-	dprintk("%s: after merging\n", __func__);
-	print_elist(list);
 	/* FIXME - The per-list consistency checks have all been done,
 	 * should now check cross-list consistency.
 	 */
@@ -502,6 +547,49 @@ bl_find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
 	return ret;
 }
 
+int
+encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+			       struct xdr_stream *xdr,
+			       const struct nfs4_layoutcommit_args *arg)
+{
+	struct pnfs_block_short_extent *lce, *save;
+	unsigned int count = 0;
+	__be32 *p, *xdr_start;
+
+	dprintk("%s enter\n", __func__);
+	/* BUG - creation of bl_commit is buggy - need to wait for
+	 * entire block to be marked WRITTEN before it can be added.
+	 */
+	spin_lock(&bl->bl_ext_lock);
+	/* Want to adjust for possible truncate */
+	/* We now want to adjust argument range */
+
+	/* XDR encode the ranges found */
+	xdr_start = xdr_reserve_space(xdr, 8);
+	if (!xdr_start)
+		goto out;
+	list_for_each_entry_safe(lce, save, &bl->bl_commit, bse_node) {
+		p = xdr_reserve_space(xdr, 7 * 4 + sizeof(lce->bse_devid.data));
+		if (!p)
+			break;
+		p = xdr_encode_opaque_fixed(p, lce->bse_devid.data, NFS4_DEVICEID4_SIZE);
+		p = xdr_encode_hyper(p, lce->bse_f_offset << SECTOR_SHIFT);
+		p = xdr_encode_hyper(p, lce->bse_length << SECTOR_SHIFT);
+		p = xdr_encode_hyper(p, 0LL);
+		*p++ = cpu_to_be32(PNFS_BLOCK_READWRITE_DATA);
+		list_del(&lce->bse_node);
+		list_add_tail(&lce->bse_node, &bl->bl_committing);
+		bl->bl_count--;
+		count++;
+	}
+	xdr_start[0] = cpu_to_be32((xdr->p - xdr_start - 1) * 4);
+	xdr_start[1] = cpu_to_be32(count);
+out:
+	spin_unlock(&bl->bl_ext_lock);
+	dprintk("%s found %i ranges\n", __func__, count);
+	return 0;
+}
+
 /* Helper function to set_to_rw that initialize a new extent */
 static void
 _prep_new_extent(struct pnfs_block_extent *new,
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 23/27] pnfsblock: cleanup_layoutcommit
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (21 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 22/27] pnfsblock: encode_layoutcommit Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 24/27] pnfsblock: bl_read_pagelist Jim Rees
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
   extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
   and a structure is allocated for communication with
   bl_encode_layoutcommit && bl_cleanup_layoutcommit
   (Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
   encoding directly into xdr. The commit-extent-list is not
   freed and is stored on above structure.
   FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
   to set_to_rw() as before, but with no need for XDR decoding
   of the list as before. And the commit-extent-list is freed.
   Finally allocated structure is freed.

[pnfsblock: introduce bl_committing list]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: cleanup_layoutcommit wants a status parameter]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayout.c |    2 +
 fs/nfs/blocklayout/blocklayout.h |    3 +
 fs/nfs/blocklayout/extents.c     |  210 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 215 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e409f63..65f885d 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -163,6 +163,8 @@ static void
 bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
 			struct nfs4_layoutcommit_data *lcdata)
 {
+	dprintk("%s enter\n", __func__);
+	clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args, lcdata->res.status);
 }
 
 static void free_blk_mountid(struct block_mount_id *mid)
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index de908da..79f564d 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -197,6 +197,9 @@ int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
 int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
 				   struct xdr_stream *xdr,
 				   const struct nfs4_layoutcommit_args *arg);
+void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+				   const struct nfs4_layoutcommit_args *arg,
+				   int status);
 int bl_add_merge_extent(struct pnfs_block_layout *bl,
 			 struct pnfs_block_extent *new);
 
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index b46c8be..a9224a1 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -329,6 +329,73 @@ void print_clist(struct list_head *list, unsigned int count)
 	}
 }
 
+/* Note: In theory, we should do more checking that devid's match between
+ * old and new, but if they don't, the lists are too corrupt to salvage anyway.
+ */
+/* Note this is very similar to bl_add_merge_extent */
+static void add_to_commitlist(struct pnfs_block_layout *bl,
+			      struct pnfs_block_short_extent *new)
+{
+	struct list_head *clist = &bl->bl_commit;
+	struct pnfs_block_short_extent *old, *save;
+	sector_t end = new->bse_f_offset + new->bse_length;
+
+	dprintk("%s enter\n", __func__);
+	print_short_extent(new);
+	print_clist(clist, bl->bl_count);
+	bl->bl_count++;
+	/* Scan for proper place to insert, extending new to the left
+	 * as much as possible.
+	 */
+	list_for_each_entry_safe(old, save, clist, bse_node) {
+		if (new->bse_f_offset < old->bse_f_offset)
+			break;
+		if (end <= old->bse_f_offset + old->bse_length) {
+			/* Range is already in list */
+			bl->bl_count--;
+			kfree(new);
+			return;
+		} else if (new->bse_f_offset <=
+				old->bse_f_offset + old->bse_length) {
+			/* new overlaps or abuts existing be */
+			if (new->bse_mdev == old->bse_mdev) {
+				/* extend new to fully replace old */
+				new->bse_length += new->bse_f_offset -
+						old->bse_f_offset;
+				new->bse_f_offset = old->bse_f_offset;
+				list_del(&old->bse_node);
+				bl->bl_count--;
+				kfree(old);
+			}
+		}
+	}
+	/* Note that if we never hit the above break, old will not point to a
+	 * valid extent.  However, in that case &old->bse_node==list.
+	 */
+	list_add_tail(&new->bse_node, &old->bse_node);
+	/* Scan forward for overlaps.  If we find any, extend new and
+	 * remove the overlapped extent.
+	 */
+	old = list_prepare_entry(new, clist, bse_node);
+	list_for_each_entry_safe_continue(old, save, clist, bse_node) {
+		if (end < old->bse_f_offset)
+			break;
+		/* new overlaps or abuts old */
+		if (new->bse_mdev == old->bse_mdev) {
+			if (end < old->bse_f_offset + old->bse_length) {
+				/* extend new to fully cover old */
+				end = old->bse_f_offset + old->bse_length;
+				new->bse_length = end - new->bse_f_offset;
+			}
+			list_del(&old->bse_node);
+			bl->bl_count--;
+			kfree(old);
+		}
+	}
+	dprintk("%s: after merging\n", __func__);
+	print_clist(clist, bl->bl_count);
+}
+
 static void print_bl_extent(struct pnfs_block_extent *be)
 {
 	dprintk("PRINT EXTENT extent %p\n", be);
@@ -547,6 +614,34 @@ bl_find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
 	return ret;
 }
 
+/* Similar to bl_find_get_extent, but called with lock held, and ignores cow */
+static struct pnfs_block_extent *
+bl_find_get_extent_locked(struct pnfs_block_layout *bl, sector_t isect)
+{
+	struct pnfs_block_extent *be, *ret = NULL;
+	int i;
+
+	dprintk("%s enter with isect %llu\n", __func__, (u64)isect);
+	for (i = 0; i < EXTENT_LISTS; i++) {
+		if (ret)
+			break;
+		list_for_each_entry_reverse(be, &bl->bl_extents[i], be_node) {
+			if (isect >= be->be_f_offset + be->be_length)
+				break;
+			if (isect >= be->be_f_offset) {
+				/* We have found an extent */
+				dprintk("%s Get %p (%i)\n", __func__, be,
+					atomic_read(&be->be_refcnt.refcount));
+				kref_get(&be->be_refcnt);
+				ret = be;
+				break;
+			}
+		}
+	}
+	print_bl_extent(ret);
+	return ret;
+}
+
 int
 encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
 			       struct xdr_stream *xdr,
@@ -636,3 +731,118 @@ _front_merge(struct pnfs_block_extent *be, struct list_head *head,
 	kfree(storage);
 	return be;
 }
+
+static u64
+set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
+{
+	u64 rv = offset + length;
+	struct pnfs_block_extent *be, *e1, *e2, *e3, *new, *old;
+	struct pnfs_block_extent *children[3];
+	struct pnfs_block_extent *merge1 = NULL, *merge2 = NULL;
+	int i = 0, j;
+
+	dprintk("%s(%llu, %llu)\n", __func__, offset, length);
+	/* Create storage for up to three new extents e1, e2, e3 */
+	e1 = kmalloc(sizeof(*e1), GFP_ATOMIC);
+	e2 = kmalloc(sizeof(*e2), GFP_ATOMIC);
+	e3 = kmalloc(sizeof(*e3), GFP_ATOMIC);
+	/* BUG - we are ignoring any failure */
+	if (!e1 || !e2 || !e3)
+		goto out_nosplit;
+
+	spin_lock(&bl->bl_ext_lock);
+	be = bl_find_get_extent_locked(bl, offset);
+	rv = be->be_f_offset + be->be_length;
+	if (be->be_state != PNFS_BLOCK_INVALID_DATA) {
+		spin_unlock(&bl->bl_ext_lock);
+		goto out_nosplit;
+	}
+	/* Add e* to children, bumping e*'s krefs */
+	if (be->be_f_offset != offset) {
+		_prep_new_extent(e1, be, be->be_f_offset,
+				 offset - be->be_f_offset,
+				 PNFS_BLOCK_INVALID_DATA);
+		children[i++] = e1;
+		print_bl_extent(e1);
+	} else
+		merge1 = e1;
+	_prep_new_extent(e2, be, offset,
+			 min(length, be->be_f_offset + be->be_length - offset),
+			 PNFS_BLOCK_READWRITE_DATA);
+	children[i++] = e2;
+	print_bl_extent(e2);
+	if (offset + length < be->be_f_offset + be->be_length) {
+		_prep_new_extent(e3, be, e2->be_f_offset + e2->be_length,
+				 be->be_f_offset + be->be_length -
+				 offset - length,
+				 PNFS_BLOCK_INVALID_DATA);
+		children[i++] = e3;
+		print_bl_extent(e3);
+	} else
+		merge2 = e3;
+
+	/* Remove be from list, and insert the e* */
+	/* We don't get refs on e*, since this list is the base reference
+	 * set when init'ed.
+	 */
+	if (i < 3)
+		children[i] = NULL;
+	new = children[0];
+	list_replace(&be->be_node, &new->be_node);
+	bl_put_extent(be);
+	new = _front_merge(new, &bl->bl_extents[RW_EXTENT], merge1);
+	for (j = 1; j < i; j++) {
+		old = new;
+		new = children[j];
+		list_add(&new->be_node, &old->be_node);
+	}
+	if (merge2) {
+		/* This is a HACK, should just create a _back_merge function */
+		new = list_entry(new->be_node.next,
+				 struct pnfs_block_extent, be_node);
+		new = _front_merge(new, &bl->bl_extents[RW_EXTENT], merge2);
+	}
+	spin_unlock(&bl->bl_ext_lock);
+
+	/* Since we removed the base reference above, be is now scheduled for
+	 * destruction.
+	 */
+	bl_put_extent(be);
+	dprintk("%s returns %llu after split\n", __func__, rv);
+	return rv;
+
+ out_nosplit:
+	kfree(e1);
+	kfree(e2);
+	kfree(e3);
+	dprintk("%s returns %llu without splitting\n", __func__, rv);
+	return rv;
+}
+
+void
+clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+			      const struct nfs4_layoutcommit_args *arg,
+			      int status)
+{
+	struct pnfs_block_short_extent *lce, *save;
+
+	dprintk("%s status %d\n", __func__, status);
+	list_for_each_entry_safe(lce, save, &bl->bl_committing, bse_node) {
+		if (likely(!status)) {
+			u64 offset = lce->bse_f_offset;
+			u64 end = offset + lce->bse_length;
+
+			do {
+				offset = set_to_rw(bl, offset, end - offset);
+			} while (offset < end);
+			list_del(&lce->bse_node);
+
+			kfree(lce);
+		} else {
+			list_del(&lce->bse_node);
+			spin_lock(&bl->bl_ext_lock);
+			add_to_commitlist(bl, lce);
+			spin_unlock(&bl->bl_ext_lock);
+		}
+	}
+}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 24/27] pnfsblock: bl_read_pagelist
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (22 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 23/27] pnfsblock: cleanup_layoutcommit Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 25/27] pnfsblock: bl_write_pagelist Jim Rees
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and  leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.

When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.

[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: read path error handling]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new read_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayout.c |  265 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 265 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 65f885d..aecd73b 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -29,10 +29,12 @@
  * of the software, even if it has been or is hereafter advised of the
  * possibility of such damages.
  */
+
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
+#include <linux/bio.h>		/* struct bio */
 
 #include "blocklayout.h"
 
@@ -45,9 +47,272 @@ MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");
 struct dentry *bl_device_pipe;
 wait_queue_head_t bl_wq;
 
+static void print_page(struct page *page)
+{
+	dprintk("PRINTPAGE page %p\n", page);
+	dprintk("	PagePrivate %d\n", PagePrivate(page));
+	dprintk("	PageUptodate %d\n", PageUptodate(page));
+	dprintk("	PageError %d\n", PageError(page));
+	dprintk("	PageDirty %d\n", PageDirty(page));
+	dprintk("	PageReferenced %d\n", PageReferenced(page));
+	dprintk("	PageLocked %d\n", PageLocked(page));
+	dprintk("	PageWriteback %d\n", PageWriteback(page));
+	dprintk("	PageMappedToDisk %d\n", PageMappedToDisk(page));
+	dprintk("\n");
+}
+
+/* Given the be associated with isect, determine if page data needs to be
+ * initialized.
+ */
+static int is_hole(struct pnfs_block_extent *be, sector_t isect)
+{
+	if (be->be_state == PNFS_BLOCK_NONE_DATA)
+		return 1;
+	else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+		return 0;
+	else
+		return !is_sector_initialized(be->be_inval, isect);
+}
+
+/* The data we are handed might be spread across several bios.  We need
+ * to track when the last one is finished.
+ */
+struct parallel_io {
+	struct kref refcnt;
+	struct rpc_call_ops call_ops;
+	void (*pnfs_callback) (void *data);
+	void *data;
+};
+
+static inline struct parallel_io *alloc_parallel(void *data)
+{
+	struct parallel_io *rv;
+
+	rv  = kmalloc(sizeof(*rv), GFP_NOFS);
+	if (rv) {
+		rv->data = data;
+		kref_init(&rv->refcnt);
+	}
+	return rv;
+}
+
+static inline void get_parallel(struct parallel_io *p)
+{
+	kref_get(&p->refcnt);
+}
+
+static void destroy_parallel(struct kref *kref)
+{
+	struct parallel_io *p = container_of(kref, struct parallel_io, refcnt);
+
+	dprintk("%s enter\n", __func__);
+	p->pnfs_callback(p->data);
+	kfree(p);
+}
+
+static inline void put_parallel(struct parallel_io *p)
+{
+	kref_put(&p->refcnt, destroy_parallel);
+}
+
+static struct bio *
+bl_submit_bio(int rw, struct bio *bio)
+{
+	if (bio) {
+		get_parallel(bio->bi_private);
+		dprintk("%s submitting %s bio %u@%llu\n", __func__,
+			rw == READ ? "read" : "write",
+			bio->bi_size, (unsigned long long)bio->bi_sector);
+		submit_bio(rw, bio);
+	}
+	return NULL;
+}
+
+static struct bio *bl_alloc_init_bio(int npg, sector_t isect,
+				     struct pnfs_block_extent *be,
+				     void (*end_io)(struct bio *, int err),
+				     struct parallel_io *par)
+{
+	struct bio *bio;
+
+	bio = bio_alloc(GFP_NOIO, npg);
+	if (!bio)
+		return NULL;
+
+	bio->bi_sector = isect - be->be_f_offset + be->be_v_offset;
+	bio->bi_bdev = be->be_mdev;
+	bio->bi_end_io = end_io;
+	bio->bi_private = par;
+	return bio;
+}
+
+static struct bio *bl_add_page_to_bio(struct bio *bio, int npg, int rw,
+				      sector_t isect, struct page *page,
+				      struct pnfs_block_extent *be,
+				      void (*end_io)(struct bio *, int err),
+				      struct parallel_io *par)
+{
+retry:
+	if (!bio) {
+		bio = bl_alloc_init_bio(npg, isect, be, end_io, par);
+		if (!bio)
+			return ERR_PTR(-ENOMEM);
+	}
+	if (bio_add_page(bio, page, PAGE_CACHE_SIZE, 0) < PAGE_CACHE_SIZE) {
+		bio = bl_submit_bio(rw, bio);
+		goto retry;
+	}
+	return bio;
+}
+
+static void bl_set_lo_fail(struct pnfs_layout_segment *lseg)
+{
+	if (lseg->pls_range.iomode == IOMODE_RW) {
+		dprintk("%s Setting layout IOMODE_RW fail bit\n", __func__);
+		set_bit(lo_fail_bit(IOMODE_RW), &lseg->pls_layout->plh_flags);
+	} else {
+		dprintk("%s Setting layout IOMODE_READ fail bit\n", __func__);
+		set_bit(lo_fail_bit(IOMODE_READ), &lseg->pls_layout->plh_flags);
+	}
+}
+
+/* This is basically copied from mpage_end_io_read */
+static void bl_end_io_read(struct bio *bio, int err)
+{
+	struct parallel_io *par = bio->bi_private;
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+	struct nfs_read_data *rdata = (struct nfs_read_data *)par->data;
+
+	do {
+		struct page *page = bvec->bv_page;
+
+		if (--bvec >= bio->bi_io_vec)
+			prefetchw(&bvec->bv_page->flags);
+		if (uptodate)
+			SetPageUptodate(page);
+	} while (bvec >= bio->bi_io_vec);
+	if (!uptodate) {
+		if (!rdata->pnfs_error)
+			rdata->pnfs_error = -EIO;
+		bl_set_lo_fail(rdata->lseg);
+	}
+	bio_put(bio);
+	put_parallel(par);
+}
+
+static void bl_read_cleanup(struct work_struct *work)
+{
+	struct rpc_task *task;
+	struct nfs_read_data *rdata;
+	dprintk("%s enter\n", __func__);
+	task = container_of(work, struct rpc_task, u.tk_work);
+	rdata = container_of(task, struct nfs_read_data, task);
+	pnfs_ld_read_done(rdata);
+}
+
+static void
+bl_end_par_io_read(void *data)
+{
+	struct nfs_read_data *rdata = data;
+
+	INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
+	schedule_work(&rdata->task.u.tk_work);
+}
+
+/* We don't want normal .rpc_call_done callback used, so we replace it
+ * with this stub.
+ */
+static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
+{
+	return;
+}
+
 static enum pnfs_try_status
 bl_read_pagelist(struct nfs_read_data *rdata)
 {
+	int i, hole;
+	struct bio *bio = NULL;
+	struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+	sector_t isect, extent_length = 0;
+	struct parallel_io *par;
+	loff_t f_offset = rdata->args.offset;
+	size_t count = rdata->args.count;
+	struct page **pages = rdata->args.pages;
+	int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
+
+	dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
+	       rdata->npages, f_offset, count);
+
+	par = alloc_parallel(rdata);
+	if (!par)
+		goto use_mds;
+	par->call_ops = *rdata->mds_ops;
+	par->call_ops.rpc_call_done = bl_rpc_do_nothing;
+	par->pnfs_callback = bl_end_par_io_read;
+	/* At this point, we can no longer jump to use_mds */
+
+	isect = (sector_t) (f_offset >> SECTOR_SHIFT);
+	/* Code assumes extents are page-aligned */
+	for (i = pg_index; i < rdata->npages; i++) {
+		if (!extent_length) {
+			/* We've used up the previous extent */
+			bl_put_extent(be);
+			bl_put_extent(cow_read);
+			bio = bl_submit_bio(READ, bio);
+			/* Get the next one */
+			be = bl_find_get_extent(BLK_LSEG2EXT(rdata->lseg),
+					     isect, &cow_read);
+			if (!be) {
+				rdata->pnfs_error = -EIO;
+				goto out;
+			}
+			extent_length = be->be_length -
+				(isect - be->be_f_offset);
+			if (cow_read) {
+				sector_t cow_length = cow_read->be_length -
+					(isect - cow_read->be_f_offset);
+				extent_length = min(extent_length, cow_length);
+			}
+		}
+		hole = is_hole(be, isect);
+		if (hole && !cow_read) {
+			bio = bl_submit_bio(READ, bio);
+			/* Fill hole w/ zeroes w/o accessing device */
+			dprintk("%s Zeroing page for hole\n", __func__);
+			zero_user_segment(pages[i], 0, PAGE_CACHE_SIZE);
+			print_page(pages[i]);
+			SetPageUptodate(pages[i]);
+		} else {
+			struct pnfs_block_extent *be_read;
+
+			be_read = (hole && cow_read) ? cow_read : be;
+			bio = bl_add_page_to_bio(bio, rdata->npages - i, READ,
+						 isect, pages[i], be_read,
+						 bl_end_io_read, par);
+			if (IS_ERR(bio)) {
+				rdata->pnfs_error = PTR_ERR(bio);
+				goto out;
+			}
+		}
+		isect += PAGE_CACHE_SECTORS;
+		extent_length -= PAGE_CACHE_SECTORS;
+	}
+	if ((isect << SECTOR_SHIFT) >= rdata->inode->i_size) {
+		rdata->res.eof = 1;
+		rdata->res.count = rdata->inode->i_size - f_offset;
+	} else {
+		rdata->res.count = (isect << SECTOR_SHIFT) - f_offset;
+	}
+out:
+	bl_put_extent(be);
+	bl_put_extent(cow_read);
+	bl_submit_bio(READ, bio);
+	put_parallel(par);
+	return PNFS_ATTEMPTED;
+
+ use_mds:
+	dprintk("Giving up and using normal NFS\n");
 	return PNFS_NOT_ATTEMPTED;
 }
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 25/27] pnfsblock: bl_write_pagelist
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (23 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 24/27] pnfsblock: bl_read_pagelist Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 26/27] pnfsblock: note written INVAL areas for layoutcommit Jim Rees
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and  leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.

When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.

[pnfsblock: bl_write_pagelist support functions]
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>

[SQUASHME: pnfsblock: mds_offset is set in the generic layer]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>

[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fixup blksize alignment in bl_setup_layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayout.c |  129 +++++++++++++++++++++++++++++++++++++-
 1 files changed, 126 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index aecd73b..c5ed569 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -74,6 +74,19 @@ static int is_hole(struct pnfs_block_extent *be, sector_t isect)
 		return !is_sector_initialized(be->be_inval, isect);
 }
 
+/* Given the be associated with isect, determine if page data can be
+ * written to disk.
+ */
+static int is_writable(struct pnfs_block_extent *be, sector_t isect)
+{
+	if (be->be_state == PNFS_BLOCK_READWRITE_DATA)
+		return 1;
+	else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+		return 0;
+	else
+		return is_sector_initialized(be->be_inval, isect);
+}
+
 /* The data we are handed might be spread across several bios.  We need
  * to track when the last one is finished.
  */
@@ -316,11 +329,121 @@ out:
 	return PNFS_NOT_ATTEMPTED;
 }
 
+/* This is basically copied from mpage_end_io_read */
+static void bl_end_io_write(struct bio *bio, int err)
+{
+	struct parallel_io *par = bio->bi_private;
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct nfs_write_data *wdata = (struct nfs_write_data *)par->data;
+
+	if (!uptodate) {
+		if (!wdata->pnfs_error)
+			wdata->pnfs_error = -EIO;
+		bl_set_lo_fail(wdata->lseg);
+	}
+	bio_put(bio);
+	put_parallel(par);
+}
+
+/* Function scheduled for call during bl_end_par_io_write,
+ * it marks sectors as written and extends the commitlist.
+ */
+static void bl_write_cleanup(struct work_struct *work)
+{
+	struct rpc_task *task;
+	struct nfs_write_data *wdata;
+	dprintk("%s enter\n", __func__);
+	task = container_of(work, struct rpc_task, u.tk_work);
+	wdata = container_of(task, struct nfs_write_data, task);
+	pnfs_ld_write_done(wdata);
+}
+
+/* Called when last of bios associated with a bl_write_pagelist call finishes */
+static void
+bl_end_par_io_write(void *data)
+{
+	struct nfs_write_data *wdata = data;
+
+	/* STUB - ignoring error handling */
+	wdata->task.tk_status = 0;
+	wdata->verf.committed = NFS_FILE_SYNC;
+	INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
+	schedule_work(&wdata->task.u.tk_work);
+}
+
 static enum pnfs_try_status
-bl_write_pagelist(struct nfs_write_data *wdata,
-		  int sync)
+bl_write_pagelist(struct nfs_write_data *wdata, int sync)
 {
-	return PNFS_NOT_ATTEMPTED;
+	int i;
+	struct bio *bio = NULL;
+	struct pnfs_block_extent *be = NULL;
+	sector_t isect, extent_length = 0;
+	struct parallel_io *par;
+	loff_t offset = wdata->args.offset;
+	size_t count = wdata->args.count;
+	struct page **pages = wdata->args.pages;
+	int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
+
+	dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
+	/* At this point, wdata->pages is a (sequential) list of nfs_pages.
+	 * We want to write each, and if there is an error remove it from
+	 * list and call
+	 * nfs_retry_request(req) to have it redone using nfs.
+	 * QUEST? Do as block or per req?  Think have to do per block
+	 * as part of end_bio
+	 */
+	par = alloc_parallel(wdata);
+	if (!par)
+		return PNFS_NOT_ATTEMPTED;
+	par->call_ops = *wdata->mds_ops;
+	par->call_ops.rpc_call_done = bl_rpc_do_nothing;
+	par->pnfs_callback = bl_end_par_io_write;
+	/* At this point, have to be more careful with error handling */
+
+	isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> SECTOR_SHIFT);
+	for (i = pg_index; i < wdata->npages ; i++) {
+		if (!extent_length) {
+			/* We've used up the previous extent */
+			bl_put_extent(be);
+			bio = bl_submit_bio(WRITE, bio);
+			/* Get the next one */
+			be = bl_find_get_extent(BLK_LSEG2EXT(wdata->lseg),
+					     isect, NULL);
+			if (!be || !is_writable(be, isect)) {
+				wdata->pnfs_error = -ENOMEM;
+				goto out;
+			}
+			extent_length = be->be_length -
+				(isect - be->be_f_offset);
+		}
+		for (;;) {
+			if (!bio) {
+				bio = bio_alloc(GFP_NOIO, wdata->npages - i);
+				if (!bio) {
+					wdata->pnfs_error = -ENOMEM;
+					goto out;
+				}
+				bio->bi_sector = isect - be->be_f_offset +
+					be->be_v_offset;
+				bio->bi_bdev = be->be_mdev;
+				bio->bi_end_io = bl_end_io_write;
+				bio->bi_private = par;
+			}
+			if (bio_add_page(bio, pages[i], PAGE_SIZE, 0))
+				break;
+			bio = bl_submit_bio(WRITE, bio);
+		}
+		isect += PAGE_CACHE_SECTORS;
+		extent_length -= PAGE_CACHE_SECTORS;
+	}
+	wdata->res.count = (isect << SECTOR_SHIFT) - (offset);
+	if (count < wdata->res.count)
+		wdata->res.count = count;
+out:
+	bl_put_extent(be);
+	bl_submit_bio(WRITE, bio);
+	put_parallel(par);
+	return PNFS_ATTEMPTED;
 }
 
 /* FIXME - range ignored */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 26/27] pnfsblock: note written INVAL areas for layoutcommit
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (24 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 25/27] pnfsblock: bl_write_pagelist Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-28 17:31 ` [PATCH v4 27/27] pnfsblock: write_pagelist handle zero invalid extents Jim Rees
  2011-07-29 15:51 ` [PATCH v4 00/27] add block layout driver to pnfs client Christoph Hellwig
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Fred Isaman <iisaman@citi.umich.edu>

Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
---
 fs/nfs/blocklayout/blocklayout.c |   32 +++++++++++++
 fs/nfs/blocklayout/blocklayout.h |    2 +
 fs/nfs/blocklayout/extents.c     |   95 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index c5ed569..be7b9d2 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -329,6 +329,30 @@ out:
 	return PNFS_NOT_ATTEMPTED;
 }
 
+static void mark_extents_written(struct pnfs_block_layout *bl,
+				 __u64 offset, __u32 count)
+{
+	sector_t isect, end;
+	struct pnfs_block_extent *be;
+
+	dprintk("%s(%llu, %u)\n", __func__, offset, count);
+	if (count == 0)
+		return;
+	isect = (offset & (long)(PAGE_CACHE_MASK)) >> SECTOR_SHIFT;
+	end = (offset + count + PAGE_CACHE_SIZE - 1) & (long)(PAGE_CACHE_MASK);
+	end >>= SECTOR_SHIFT;
+	while (isect < end) {
+		sector_t len;
+		be = bl_find_get_extent(bl, isect, NULL);
+		BUG_ON(!be); /* FIXME */
+		len = min(end, be->be_f_offset + be->be_length) - isect;
+		if (be->be_state == PNFS_BLOCK_INVALID_DATA)
+			bl_mark_for_commit(be, isect, len); /* What if fails? */
+		isect += len;
+		bl_put_extent(be);
+	}
+}
+
 /* This is basically copied from mpage_end_io_read */
 static void bl_end_io_write(struct bio *bio, int err)
 {
@@ -355,6 +379,14 @@ static void bl_write_cleanup(struct work_struct *work)
 	dprintk("%s enter\n", __func__);
 	task = container_of(work, struct rpc_task, u.tk_work);
 	wdata = container_of(task, struct nfs_write_data, task);
+	if (!wdata->task.tk_status) {
+		/* Marks for LAYOUTCOMMIT */
+		/* BUG - this should be called after each bio, not after
+		 * all finish, unless have some way of storing success/failure
+		 */
+		mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
+				     wdata->args.offset, wdata->args.count);
+	}
 	pnfs_ld_write_done(wdata);
 }
 
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 79f564d..d006d95 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -202,5 +202,7 @@ void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
 				   int status);
 int bl_add_merge_extent(struct pnfs_block_layout *bl,
 			 struct pnfs_block_extent *new);
+int bl_mark_for_commit(struct pnfs_block_extent *be,
+			sector_t offset, sector_t length);
 
 #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index a9224a1..c527365 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -217,6 +217,48 @@ int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
 	return rv;
 }
 
+/* Assume start, end already sector aligned */
+static int
+_range_has_tag(struct my_tree *tree, u64 start, u64 end, int32_t tag)
+{
+	struct pnfs_inval_tracking *pos;
+	u64 expect = 0;
+
+	dprintk("%s(%llu, %llu, %i) enter\n", __func__, start, end, tag);
+	list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+		if (pos->it_sector >= end)
+			continue;
+		if (!expect) {
+			if ((pos->it_sector == end - tree->mtt_step_size) &&
+			    (pos->it_tags & (1 << tag))) {
+				expect = pos->it_sector - tree->mtt_step_size;
+				if (pos->it_sector < tree->mtt_step_size || expect < start)
+					return 1;
+				continue;
+			} else {
+				return 0;
+			}
+		}
+		if (pos->it_sector != expect || !(pos->it_tags & (1 << tag)))
+			return 0;
+		expect -= tree->mtt_step_size;
+		if (expect < start)
+			return 1;
+	}
+	return 0;
+}
+
+static int is_range_written(struct pnfs_inval_markings *marks,
+			    sector_t start, sector_t end)
+{
+	int rv;
+
+	spin_lock(&marks->im_lock);
+	rv = _range_has_tag(&marks->im_tree, start, end, EXTENT_WRITTEN);
+	spin_unlock(&marks->im_lock);
+	return rv;
+}
+
 /* Marks sectors in [offest, offset_length) as having been initialized.
  * All lengths are step-aligned, where step is min(pagesize, blocksize).
  * Notes where partial block is initialized, and helps prepare it for
@@ -396,6 +438,59 @@ static void add_to_commitlist(struct pnfs_block_layout *bl,
 	print_clist(clist, bl->bl_count);
 }
 
+/* Note the range described by offset, length is guaranteed to be contained
+ * within be.
+ */
+int bl_mark_for_commit(struct pnfs_block_extent *be,
+		    sector_t offset, sector_t length)
+{
+	sector_t new_end, end = offset + length;
+	struct pnfs_block_short_extent *new;
+	struct pnfs_block_layout *bl = container_of(be->be_inval,
+						    struct pnfs_block_layout,
+						    bl_inval);
+
+	new = kmalloc(sizeof(*new), GFP_NOFS);
+	if (!new)
+		return -ENOMEM;
+
+	mark_written_sectors(be->be_inval, offset, length);
+	/* We want to add the range to commit list, but it must be
+	 * block-normalized, and verified that the normalized range has
+	 * been entirely written to disk.
+	 */
+	new->bse_f_offset = offset;
+	offset = normalize(offset, bl->bl_blocksize);
+	if (offset < new->bse_f_offset) {
+		if (is_range_written(be->be_inval, offset, new->bse_f_offset))
+			new->bse_f_offset = offset;
+		else
+			new->bse_f_offset = offset + bl->bl_blocksize;
+	}
+	new_end = normalize_up(end, bl->bl_blocksize);
+	if (end < new_end) {
+		if (is_range_written(be->be_inval, end, new_end))
+			end = new_end;
+		else
+			end = new_end - bl->bl_blocksize;
+	}
+	if (end <= new->bse_f_offset) {
+		kfree(new);
+		return 0;
+	}
+	new->bse_length = end - new->bse_f_offset;
+	new->bse_devid = be->be_devid;
+	new->bse_mdev = be->be_mdev;
+
+	spin_lock(&bl->bl_ext_lock);
+	/* new will be freed, either by add_to_commitlist if it decides not
+	 * to use it, or after LAYOUTCOMMIT uses it in the commitlist.
+	 */
+	add_to_commitlist(bl, new);
+	spin_unlock(&bl->bl_ext_lock);
+	return 0;
+}
+
 static void print_bl_extent(struct pnfs_block_extent *be)
 {
 	dprintk("PRINT EXTENT extent %p\n", be);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 27/27] pnfsblock: write_pagelist handle zero invalid extents
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (25 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 26/27] pnfsblock: note written INVAL areas for layoutcommit Jim Rees
@ 2011-07-28 17:31 ` Jim Rees
  2011-07-29 15:51 ` [PATCH v4 00/27] add block layout driver to pnfs client Christoph Hellwig
  27 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-28 17:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, peter honeyman

From: Peng Tao <bergwolf@gmail.com>

For invalid extents, find other pages in the same fsblock and write them out.

[pnfsblock: write_begin]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/blocklayout/blocklayout.c |  275 ++++++++++++++++++++++++++++++++------
 1 files changed, 233 insertions(+), 42 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index be7b9d2..81efa05 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -35,6 +35,7 @@
 #include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/bio.h>		/* struct bio */
+#include <linux/buffer_head.h>	/* various write calls */
 
 #include "blocklayout.h"
 
@@ -79,12 +80,8 @@ static int is_hole(struct pnfs_block_extent *be, sector_t isect)
  */
 static int is_writable(struct pnfs_block_extent *be, sector_t isect)
 {
-	if (be->be_state == PNFS_BLOCK_READWRITE_DATA)
-		return 1;
-	else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
-		return 0;
-	else
-		return is_sector_initialized(be->be_inval, isect);
+	return (be->be_state == PNFS_BLOCK_READWRITE_DATA ||
+		be->be_state == PNFS_BLOCK_INVALID_DATA);
 }
 
 /* The data we are handed might be spread across several bios.  We need
@@ -353,6 +350,31 @@ static void mark_extents_written(struct pnfs_block_layout *bl,
 	}
 }
 
+static void bl_end_io_write_zero(struct bio *bio, int err)
+{
+	struct parallel_io *par = bio->bi_private;
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+	struct nfs_write_data *wdata = (struct nfs_write_data *)par->data;
+
+	do {
+		struct page *page = bvec->bv_page;
+
+		if (--bvec >= bio->bi_io_vec)
+			prefetchw(&bvec->bv_page->flags);
+		/* This is the zeroing page we added */
+		end_page_writeback(page);
+		page_cache_release(page);
+	} while (bvec >= bio->bi_io_vec);
+	if (!uptodate) {
+		if (!wdata->pnfs_error)
+			wdata->pnfs_error = -EIO;
+		bl_set_lo_fail(wdata->lseg);
+	}
+	bio_put(bio);
+	put_parallel(par);
+}
+
 /* This is basically copied from mpage_end_io_read */
 static void bl_end_io_write(struct bio *bio, int err)
 {
@@ -379,11 +401,8 @@ static void bl_write_cleanup(struct work_struct *work)
 	dprintk("%s enter\n", __func__);
 	task = container_of(work, struct rpc_task, u.tk_work);
 	wdata = container_of(task, struct nfs_write_data, task);
-	if (!wdata->task.tk_status) {
+	if (!wdata->pnfs_error) {
 		/* Marks for LAYOUTCOMMIT */
-		/* BUG - this should be called after each bio, not after
-		 * all finish, unless have some way of storing success/failure
-		 */
 		mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
 				     wdata->args.offset, wdata->args.count);
 	}
@@ -391,38 +410,110 @@ static void bl_write_cleanup(struct work_struct *work)
 }
 
 /* Called when last of bios associated with a bl_write_pagelist call finishes */
-static void
-bl_end_par_io_write(void *data)
+static void bl_end_par_io_write(void *data)
 {
 	struct nfs_write_data *wdata = data;
 
-	/* STUB - ignoring error handling */
 	wdata->task.tk_status = 0;
 	wdata->verf.committed = NFS_FILE_SYNC;
 	INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
 	schedule_work(&wdata->task.u.tk_work);
 }
 
+/* FIXME STUB - mark intersection of layout and page as bad, so is not
+ * used again.
+ */
+static void mark_bad_read(void)
+{
+	return;
+}
+
+/*
+ * map_block:  map a requested I/0 block (isect) into an offset in the LVM
+ * block_device
+ */
+static void
+map_block(struct buffer_head *bh, sector_t isect, struct pnfs_block_extent *be)
+{
+	dprintk("%s enter be=%p\n", __func__, be);
+
+	set_buffer_mapped(bh);
+	bh->b_bdev = be->be_mdev;
+	bh->b_blocknr = (isect - be->be_f_offset + be->be_v_offset) >>
+	    (be->be_mdev->bd_inode->i_blkbits - SECTOR_SHIFT);
+
+	dprintk("%s isect %llu, bh->b_blocknr %ld, using bsize %Zd\n",
+		__func__, (unsigned long long)isect, (long)bh->b_blocknr,
+		bh->b_size);
+	return;
+}
+
+/* Given an unmapped page, zero it or read in page for COW, page is locked
+ * by caller.
+ */
+static int
+init_page_for_write(struct page *page, struct pnfs_block_extent *cow_read)
+{
+	struct buffer_head *bh = NULL;
+	int ret = 0;
+	sector_t isect;
+
+	dprintk("%s enter, %p\n", __func__, page);
+	BUG_ON(PageUptodate(page));
+	if (!cow_read) {
+		zero_user_segment(page, 0, PAGE_SIZE);
+		SetPageUptodate(page);
+		goto cleanup;
+	}
+
+	bh = alloc_page_buffers(page, PAGE_CACHE_SIZE, 0);
+	if (!bh) {
+		ret = -ENOMEM;
+		goto cleanup;
+	}
+
+	isect = (sector_t) page->index << PAGE_CACHE_SECTOR_SHIFT;
+	map_block(bh, isect, cow_read);
+	if (!bh_uptodate_or_lock(bh))
+		ret = bh_submit_read(bh);
+	if (ret)
+		goto cleanup;
+	SetPageUptodate(page);
+
+cleanup:
+	bl_put_extent(cow_read);
+	if (bh)
+		free_buffer_head(bh);
+	if (ret) {
+		/* Need to mark layout with bad read...should now
+		 * just use nfs4 for reads and writes.
+		 */
+		mark_bad_read();
+	}
+	return ret;
+}
+
 static enum pnfs_try_status
 bl_write_pagelist(struct nfs_write_data *wdata, int sync)
 {
-	int i;
+	int i, ret, npg_zero, pg_index, last = 0;
 	struct bio *bio = NULL;
-	struct pnfs_block_extent *be = NULL;
-	sector_t isect, extent_length = 0;
+	struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+	sector_t isect, last_isect = 0, extent_length = 0;
 	struct parallel_io *par;
 	loff_t offset = wdata->args.offset;
 	size_t count = wdata->args.count;
 	struct page **pages = wdata->args.pages;
-	int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
+	struct page *page;
+	pgoff_t index;
+	u64 temp;
+	int npg_per_block =
+	    NFS_SERVER(wdata->inode)->pnfs_blksize >> PAGE_CACHE_SHIFT;
 
 	dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
 	/* At this point, wdata->pages is a (sequential) list of nfs_pages.
-	 * We want to write each, and if there is an error remove it from
-	 * list and call
-	 * nfs_retry_request(req) to have it redone using nfs.
-	 * QUEST? Do as block or per req?  Think have to do per block
-	 * as part of end_bio
+	 * We want to write each, and if there is an error set pnfs_error
+	 * to have it redone using nfs.
 	 */
 	par = alloc_parallel(wdata);
 	if (!par)
@@ -433,7 +524,91 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
 	/* At this point, have to be more careful with error handling */
 
 	isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> SECTOR_SHIFT);
-	for (i = pg_index; i < wdata->npages ; i++) {
+	be = bl_find_get_extent(BLK_LSEG2EXT(wdata->lseg), isect, &cow_read);
+	if (!be || !is_writable(be, isect)) {
+		dprintk("%s no matching extents!\n", __func__);
+		wdata->pnfs_error = -EINVAL;
+		goto out;
+	}
+
+	/* First page inside INVALID extent */
+	if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+		temp = offset >> PAGE_CACHE_SHIFT;
+		npg_zero = do_div(temp, npg_per_block);
+		isect = (sector_t) (((offset - npg_zero * PAGE_CACHE_SIZE) &
+				     (long)PAGE_CACHE_MASK) >> SECTOR_SHIFT);
+		extent_length = be->be_length - (isect - be->be_f_offset);
+
+fill_invalid_ext:
+		dprintk("%s need to zero %d pages\n", __func__, npg_zero);
+		for (;npg_zero > 0; npg_zero--) {
+			/* page ref released in bl_end_io_write_zero */
+			index = isect >> PAGE_CACHE_SECTOR_SHIFT;
+			dprintk("%s zero %dth page: index %lu isect %llu\n",
+				__func__, npg_zero, index,
+				(unsigned long long)isect);
+			page =
+			    find_or_create_page(wdata->inode->i_mapping, index,
+						GFP_NOFS);
+			if (!page) {
+				dprintk("%s oom\n", __func__);
+				wdata->pnfs_error = -ENOMEM;
+				goto out;
+			}
+
+			/* PageDirty: Other will write this out
+			 * PageWriteback: Other is writing this out
+			 * PageUptodate: It was read before
+			 * sector_initialized: already written out
+			 */
+			if (PageDirty(page) || PageWriteback(page) ||
+			    is_sector_initialized(be->be_inval, isect)) {
+				print_page(page);
+				unlock_page(page);
+				page_cache_release(page);
+				goto next_page;
+			}
+			if (!PageUptodate(page)) {
+				/* New page, readin or zero it */
+				init_page_for_write(page, cow_read);
+			}
+			set_page_writeback(page);
+			unlock_page(page);
+
+			ret = bl_mark_sectors_init(be->be_inval, isect,
+						       PAGE_CACHE_SECTORS,
+						       NULL);
+			if (unlikely(ret)) {
+				dprintk("%s bl_mark_sectors_init fail %d\n",
+					__func__, ret);
+				end_page_writeback(page);
+				page_cache_release(page);
+				wdata->pnfs_error = ret;
+				goto out;
+			}
+			bio = bl_add_page_to_bio(bio, npg_zero, WRITE,
+						 isect, page, be,
+						 bl_end_io_write_zero, par);
+			if (IS_ERR(bio)) {
+				wdata->pnfs_error = PTR_ERR(bio);
+				goto out;
+			}
+			/* FIXME: This should be done in bi_end_io */
+			mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
+					     page->index << PAGE_CACHE_SHIFT,
+					     PAGE_CACHE_SIZE);
+next_page:
+			isect += PAGE_CACHE_SECTORS;
+			extent_length -= PAGE_CACHE_SECTORS;
+		}
+		if (last)
+			goto write_done;
+	}
+	bio = bl_submit_bio(WRITE, bio);
+
+	/* Middle pages */
+	pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
+	for (i = pg_index; i < wdata->npages; i++) {
 		if (!extent_length) {
 			/* We've used up the previous extent */
 			bl_put_extent(be);
@@ -442,35 +617,51 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
 			be = bl_find_get_extent(BLK_LSEG2EXT(wdata->lseg),
 					     isect, NULL);
 			if (!be || !is_writable(be, isect)) {
-				wdata->pnfs_error = -ENOMEM;
+				wdata->pnfs_error = -EINVAL;
 				goto out;
 			}
 			extent_length = be->be_length -
-				(isect - be->be_f_offset);
+			    (isect - be->be_f_offset);
 		}
-		for (;;) {
-			if (!bio) {
-				bio = bio_alloc(GFP_NOIO, wdata->npages - i);
-				if (!bio) {
-					wdata->pnfs_error = -ENOMEM;
-					goto out;
-				}
-				bio->bi_sector = isect - be->be_f_offset +
-					be->be_v_offset;
-				bio->bi_bdev = be->be_mdev;
-				bio->bi_end_io = bl_end_io_write;
-				bio->bi_private = par;
+		if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+			ret = bl_mark_sectors_init(be->be_inval, isect,
+						       PAGE_CACHE_SECTORS,
+						       NULL);
+			if (unlikely(ret)) {
+				dprintk("%s bl_mark_sectors_init fail %d\n",
+					__func__, ret);
+				wdata->pnfs_error = ret;
+				goto out;
 			}
-			if (bio_add_page(bio, pages[i], PAGE_SIZE, 0))
-				break;
-			bio = bl_submit_bio(WRITE, bio);
+		}
+		bio = bl_add_page_to_bio(bio, wdata->npages - i, WRITE,
+					 isect, pages[i], be,
+					 bl_end_io_write, par);
+		if (IS_ERR(bio)) {
+			wdata->pnfs_error = PTR_ERR(bio);
+			goto out;
 		}
 		isect += PAGE_CACHE_SECTORS;
+		last_isect = isect;
 		extent_length -= PAGE_CACHE_SECTORS;
 	}
-	wdata->res.count = (isect << SECTOR_SHIFT) - (offset);
-	if (count < wdata->res.count)
+
+	/* Last page inside INVALID extent */
+	if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+		bio = bl_submit_bio(WRITE, bio);
+		temp = last_isect >> PAGE_CACHE_SECTOR_SHIFT;
+		npg_zero = npg_per_block - do_div(temp, npg_per_block);
+		if (npg_zero < npg_per_block) {
+			last = 1;
+			goto fill_invalid_ext;
+		}
+	}
+
+write_done:
+	wdata->res.count = (last_isect << SECTOR_SHIFT) - (offset);
+	if (count < wdata->res.count) {
 		wdata->res.count = count;
+	}
 out:
 	bl_put_extent(be);
 	bl_submit_bio(WRITE, bio);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 09/27] pnfs: cleanup_layoutcommit
  2011-07-28 17:30 ` [PATCH v4 09/27] pnfs: cleanup_layoutcommit Jim Rees
@ 2011-07-28 18:26   ` Boaz Harrosh
  2011-07-29  3:16     ` Jim Rees
  0 siblings, 1 reply; 63+ messages in thread
From: Boaz Harrosh @ 2011-07-28 18:26 UTC (permalink / raw)
  To: Jim Rees; +Cc: Trond Myklebust, linux-nfs, peter honeyman

On 07/28/2011 10:30 AM, Jim Rees wrote:
> From: Andy Adamson <andros@netapp.com>
> 
> This gives layout driver a chance to cleanup structures they put in at
> encode_layoutcommit.
> 
> Signed-off-by: Andy Adamson <andros@netapp.com>
> [fixup layout header pointer for layoutcommit]
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> Signed-off-by: Benny Halevy <bhalevy@tonian.com>
> ---
>  fs/nfs/nfs4proc.c       |    1 +
>  fs/nfs/nfs4xdr.c        |    1 +
>  fs/nfs/pnfs.c           |   10 ++++++++++
>  fs/nfs/pnfs.h           |    5 +++++
>  include/linux/nfs_xdr.h |    1 +
>  5 files changed, 18 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index e86de79..6cb84b4 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5963,6 +5963,7 @@ static void nfs4_layoutcommit_release(void *calldata)
>  	struct nfs4_layoutcommit_data *data = calldata;
>  	struct pnfs_layout_segment *lseg, *tmp;
>  
> +	pnfs_cleanup_layoutcommit(data->args.inode, data);

If inode is part of @data, which is also passed as argument then we can simplify
the API by just passing @data

>  	/* Matched by references in pnfs_set_layoutcommit */
>  	list_for_each_entry_safe(lseg, tmp, &data->lseg_list, pls_lc_list) {
>  		list_del_init(&lseg->pls_lc_list);
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index 0261669..1dce12f 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -5599,6 +5599,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
>  	int status;
>  
>  	status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
> +	res->status = status;
>  	if (status)
>  		return status;
>  
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 3a47f7c..c1cc216 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1411,6 +1411,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  }
>  EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
>  
> +void pnfs_cleanup_layoutcommit(struct inode *inode,
> +			       struct nfs4_layoutcommit_data *data)
> +{
> +	struct nfs_server *nfss = NFS_SERVER(inode);
> +
> +	if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
> +		nfss->pnfs_curr_ld->cleanup_layoutcommit(NFS_I(inode)->layout,
> +							 data);

Here too since data has inode then the LD can do the:
	NFS_I(data->args.inode)->layout

de-reference, and only pass @data as argument

Boaz

> +}
> +
>  /*
>   * For the LAYOUT4_NFSV4_1_FILES layout type, NFS_DATA_SYNC WRITEs and
>   * NFS_UNSTABLE WRITEs with a COMMIT to data servers must store enough
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index bddd8b9..f271425 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -113,6 +113,9 @@ struct pnfs_layoutdriver_type {
>  				     struct xdr_stream *xdr,
>  				     const struct nfs4_layoutreturn_args *args);
>  
> +	void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
> +				      struct nfs4_layoutcommit_data *data);
> +
>  	void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
>  				     struct xdr_stream *xdr,
>  				     const struct nfs4_layoutcommit_args *args);
> @@ -196,6 +199,8 @@ void pnfs_roc_release(struct inode *ino);
>  void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
>  bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
>  void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
> +void pnfs_cleanup_layoutcommit(struct inode *inode,
> +			       struct nfs4_layoutcommit_data *data);
>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>  int _pnfs_return_layout(struct inode *);
>  int pnfs_ld_write_done(struct nfs_write_data *);
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 94f27e5..569ea5b 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -269,6 +269,7 @@ struct nfs4_layoutcommit_res {
>  	struct nfs_fattr *fattr;
>  	const struct nfs_server *server;
>  	struct nfs4_sequence_res seq_res;
> +	int status;
>  };
>  
>  struct nfs4_layoutcommit_data {


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 05/27] pnfs: let layoutcommit handle a list of lseg
  2011-07-28 17:30 ` [PATCH v4 05/27] pnfs: let layoutcommit handle a list of lseg Jim Rees
@ 2011-07-28 18:52   ` Boaz Harrosh
  0 siblings, 0 replies; 63+ messages in thread
From: Boaz Harrosh @ 2011-07-28 18:52 UTC (permalink / raw)
  To: Jim Rees, Trond Myklebust, Peng Tao; +Cc: linux-nfs, peter honeyman

On 07/28/2011 10:30 AM, Jim Rees wrote:
> From: Peng Tao <bergwolf@gmail.com>
> 
> There can be multiple lseg per file, so layoutcommit should be
> able to handle it.
> 

Thanks Peng, Jim

Trond

I think this patch and the next one should be the minimal set
that I need for Stable.

Let me test Vanila 3.0 and I'll come back to you

Thanks
Boaz

> Signed-off-by: Peng Tao <peng_tao@emc.com>

> ---
>  fs/nfs/nfs4proc.c       |    8 +++++++-
>  fs/nfs/pnfs.c           |   34 +++++++++++++++++-----------------
>  fs/nfs/pnfs.h           |    2 ++
>  include/linux/nfs_xdr.h |    2 +-
>  4 files changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index ebb6f1a..af32d3d 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5960,9 +5960,15 @@ nfs4_layoutcommit_done(struct rpc_task *task, void *calldata)
>  static void nfs4_layoutcommit_release(void *calldata)
>  {
>  	struct nfs4_layoutcommit_data *data = calldata;
> +	struct pnfs_layout_segment *lseg, *tmp;
>  
>  	/* Matched by references in pnfs_set_layoutcommit */
> -	put_lseg(data->lseg);
> +	list_for_each_entry_safe(lseg, tmp, &data->lseg_list, pls_lc_list) {
> +		list_del_init(&lseg->pls_lc_list);
> +		if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT,
> +				       &lseg->pls_flags))
> +			put_lseg(lseg);
> +	}
>  	put_rpccred(data->cred);
>  	kfree(data);
>  }
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 201165e..e2c1eb4 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -235,6 +235,7 @@ static void
>  init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
>  {
>  	INIT_LIST_HEAD(&lseg->pls_list);
> +	INIT_LIST_HEAD(&lseg->pls_lc_list);
>  	atomic_set(&lseg->pls_refcount, 1);
>  	smp_mb();
>  	set_bit(NFS_LSEG_VALID, &lseg->pls_flags);
> @@ -1361,16 +1362,17 @@ pnfs_generic_pg_readpages(struct nfs_pageio_descriptor *desc)
>  EXPORT_SYMBOL_GPL(pnfs_generic_pg_readpages);
>  
>  /*
> - * Currently there is only one (whole file) write lseg.
> + * There can be multiple RW segments.
>   */
> -static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
> +static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp)
>  {
> -	struct pnfs_layout_segment *lseg, *rv = NULL;
> +	struct pnfs_layout_segment *lseg;
>  
> -	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
> -		if (lseg->pls_range.iomode == IOMODE_RW)
> -			rv = lseg;
> -	return rv;
> +	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
> +		if (lseg->pls_range.iomode == IOMODE_RW &&
> +		    test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags))
> +			list_add(&lseg->pls_lc_list, listp);
> +	}
>  }
>  
>  void
> @@ -1382,14 +1384,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>  
>  	spin_lock(&nfsi->vfs_inode.i_lock);
>  	if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
> -		/* references matched in nfs4_layoutcommit_release */
> -		get_lseg(wdata->lseg);
> +		mark_as_dirty = true;
>  		nfsi->layout->plh_lc_cred =
>  			get_rpccred(wdata->args.context->state->owner->so_cred);
> -		mark_as_dirty = true;
>  		dprintk("%s: Set layoutcommit for inode %lu ",
>  			__func__, wdata->inode->i_ino);
>  	}
> +	if (!test_and_set_bit(NFS_LSEG_LAYOUTCOMMIT, &wdata->lseg->pls_flags)) {
> +		/* references matched in nfs4_layoutcommit_release */
> +		get_lseg(wdata->lseg);
> +	}
>  	if (end_pos > nfsi->layout->plh_lwb)
>  		nfsi->layout->plh_lwb = end_pos;
>  	spin_unlock(&nfsi->vfs_inode.i_lock);
> @@ -1416,7 +1420,6 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
>  {
>  	struct nfs4_layoutcommit_data *data;
>  	struct nfs_inode *nfsi = NFS_I(inode);
> -	struct pnfs_layout_segment *lseg;
>  	struct rpc_cred *cred;
>  	loff_t end_pos;
>  	int status = 0;
> @@ -1434,17 +1437,15 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
>  		goto out;
>  	}
>  
> +	INIT_LIST_HEAD(&data->lseg_list);
>  	spin_lock(&inode->i_lock);
>  	if (!test_and_clear_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
>  		spin_unlock(&inode->i_lock);
>  		kfree(data);
>  		goto out;
>  	}
> -	/*
> -	 * Currently only one (whole file) write lseg which is referenced
> -	 * in pnfs_set_layoutcommit and will be found.
> -	 */
> -	lseg = pnfs_list_write_lseg(inode);
> +
> +	pnfs_list_write_lseg(inode, &data->lseg_list);
>  
>  	end_pos = nfsi->layout->plh_lwb;
>  	cred = nfsi->layout->plh_lc_cred;
> @@ -1456,7 +1457,6 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
>  	spin_unlock(&inode->i_lock);
>  
>  	data->args.inode = inode;
> -	data->lseg = lseg;
>  	data->cred = cred;
>  	nfs_fattr_init(&data->fattr);
>  	data->args.bitmask = NFS_SERVER(inode)->cache_consistency_bitmask;
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index ac86c36..bddd8b9 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -36,10 +36,12 @@
>  enum {
>  	NFS_LSEG_VALID = 0,	/* cleared when lseg is recalled/returned */
>  	NFS_LSEG_ROC,		/* roc bit received from server */
> +	NFS_LSEG_LAYOUTCOMMIT,	/* layoutcommit bit set for layoutcommit */
>  };
>  
>  struct pnfs_layout_segment {
>  	struct list_head pls_list;
> +	struct list_head pls_lc_list;
>  	struct pnfs_layout_range pls_range;
>  	atomic_t pls_refcount;
>  	unsigned long pls_flags;
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index a07b682..21f333e 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -273,7 +273,7 @@ struct nfs4_layoutcommit_res {
>  struct nfs4_layoutcommit_data {
>  	struct rpc_task task;
>  	struct nfs_fattr fattr;
> -	struct pnfs_layout_segment *lseg;
> +	struct list_head lseg_list;
>  	struct rpc_cred *cred;
>  	struct nfs4_layoutcommit_args args;
>  	struct nfs4_layoutcommit_res res;


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 09/27] pnfs: cleanup_layoutcommit
  2011-07-28 18:26   ` Boaz Harrosh
@ 2011-07-29  3:16     ` Jim Rees
  0 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-29  3:16 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Trond Myklebust, linux-nfs, peter honeyman

Boaz Harrosh wrote:

  On 07/28/2011 10:30 AM, Jim Rees wrote:
  > From: Andy Adamson <andros@netapp.com>
  > 
  > This gives layout driver a chance to cleanup structures they put in at
  > encode_layoutcommit.
  > 
  > Signed-off-by: Andy Adamson <andros@netapp.com>
  > [fixup layout header pointer for layoutcommit]
  > Signed-off-by: Benny Halevy <bhalevy@panasas.com>
  > Signed-off-by: Benny Halevy <bhalevy@tonian.com>
  > ---
  >  fs/nfs/nfs4proc.c       |    1 +
  >  fs/nfs/nfs4xdr.c        |    1 +
  >  fs/nfs/pnfs.c           |   10 ++++++++++
  >  fs/nfs/pnfs.h           |    5 +++++
  >  include/linux/nfs_xdr.h |    1 +
  >  5 files changed, 18 insertions(+), 0 deletions(-)
  > 
  > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
  > index e86de79..6cb84b4 100644
  > --- a/fs/nfs/nfs4proc.c
  > +++ b/fs/nfs/nfs4proc.c
  > @@ -5963,6 +5963,7 @@ static void nfs4_layoutcommit_release(void *calldata)
  >  	struct nfs4_layoutcommit_data *data = calldata;
  >  	struct pnfs_layout_segment *lseg, *tmp;
  >  
  > +	pnfs_cleanup_layoutcommit(data->args.inode, data);
  
  If inode is part of @data, which is also passed as argument then we can simplify
  the API by just passing @data

Thanks, I've applied the following patch, which will be squashed and
included in the next version of the patch set.  Which I will probably send
tomorrow.

>From b757a45f208f31f8eff5ec499f1e99895713a17c Mon Sep 17 00:00:00 2001
From: Jim Rees <rees@umich.edu>
Date: Thu, 28 Jul 2011 15:26:49 -0400
Subject: [PATCH] SQUASHME: rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()

Signed-off-by: Jim Rees <rees@umich.edu>
---
 fs/nfs/blocklayout/blocklayout.c |    5 +++--
 fs/nfs/nfs4proc.c                |    2 +-
 fs/nfs/pnfs.c                    |    8 +++-----
 fs/nfs/pnfs.h                    |    6 ++----
 4 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 81efa05..7450309 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -771,9 +771,10 @@ bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
 }
 
 static void
-bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
-			struct nfs4_layoutcommit_data *lcdata)
+bl_cleanup_layoutcommit(struct nfs4_layoutcommit_data *lcdata)
 {
+	struct pnfs_layout_hdr *lo = NFS_I(lcdata->args.inode)->layout;
+
 	dprintk("%s enter\n", __func__);
 	clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args, lcdata->res.status);
 }
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 6cb84b4..8c77039 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5963,7 +5963,7 @@ static void nfs4_layoutcommit_release(void *calldata)
 	struct nfs4_layoutcommit_data *data = calldata;
 	struct pnfs_layout_segment *lseg, *tmp;
 
-	pnfs_cleanup_layoutcommit(data->args.inode, data);
+	pnfs_cleanup_layoutcommit(data);
 	/* Matched by references in pnfs_set_layoutcommit */
 	list_for_each_entry_safe(lseg, tmp, &data->lseg_list, pls_lc_list) {
 		list_del_init(&lseg->pls_lc_list);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index c1cc216..e550e88 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1411,14 +1411,12 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
 }
 EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
 
-void pnfs_cleanup_layoutcommit(struct inode *inode,
-			       struct nfs4_layoutcommit_data *data)
+void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data)
 {
-	struct nfs_server *nfss = NFS_SERVER(inode);
+	struct nfs_server *nfss = NFS_SERVER(data->args.inode);
 
 	if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
-		nfss->pnfs_curr_ld->cleanup_layoutcommit(NFS_I(inode)->layout,
-							 data);
+		nfss->pnfs_curr_ld->cleanup_layoutcommit(data);
 }
 
 /*
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 82dde37..e0b5d80 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -113,8 +113,7 @@ struct pnfs_layoutdriver_type {
 				     struct xdr_stream *xdr,
 				     const struct nfs4_layoutreturn_args *args);
 
-	void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
-				      struct nfs4_layoutcommit_data *data);
+	void (*cleanup_layoutcommit) (struct nfs4_layoutcommit_data *data);
 
 	void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
 				     struct xdr_stream *xdr,
@@ -198,8 +197,7 @@ void pnfs_roc_release(struct inode *ino);
 void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
 bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
 void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
-void pnfs_cleanup_layoutcommit(struct inode *inode,
-			       struct nfs4_layoutcommit_data *data);
+void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data);
 int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
 int _pnfs_return_layout(struct inode *);
 int pnfs_ld_write_done(struct nfs_write_data *);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
                   ` (26 preceding siblings ...)
  2011-07-28 17:31 ` [PATCH v4 27/27] pnfsblock: write_pagelist handle zero invalid extents Jim Rees
@ 2011-07-29 15:51 ` Christoph Hellwig
  2011-07-29 17:45   ` Peng Tao
  2011-07-29 18:54   ` Jim Rees
  27 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2011-07-29 15:51 UTC (permalink / raw)
  To: Jim Rees; +Cc: Trond Myklebust, linux-nfs, peter honeyman

How well is the I/O code tested?  It's a full reimplementation of
code full of nasty traps.  Did you run xfstests over it?  It supports
nfs, so pointing it to a pnfs share should probably just work.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-29 15:51 ` [PATCH v4 00/27] add block layout driver to pnfs client Christoph Hellwig
@ 2011-07-29 17:45   ` Peng Tao
  2011-07-29 18:44     ` Christoph Hellwig
  2011-07-29 18:54   ` Jim Rees
  1 sibling, 1 reply; 63+ messages in thread
From: Peng Tao @ 2011-07-29 17:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jim Rees, Trond Myklebust, linux-nfs, peter honeyman

Hi, Christoph,

On Fri, Jul 29, 2011 at 11:51 PM, Christoph Hellwig <hch@infradead.org> wrote:
> How well is the I/O code tested?  It's a full reimplementation of
> code full of nasty traps.  Did you run xfstests over it?  It supports
> nfs, so pointing it to a pnfs share should probably just work.
We have been testing the code with cthon04 for some time and all cased
are passed since the earliest version.

And I just had a try with xfstests. It seems it does not support NFSv4
right now. I had to modify common.rc to make "./check -nfs" runable.
Otherwise it failed with:
common.rc: Error: $TEST_DEV (10.244.82.74:/s4fs1/) is not a MOUNTED
nfs filesystem
I tried mounting TEST_DIR both w/ and w/o pnfs. The same errors.

After fixing common.rc, "./check -nfs" can run but failed and stopped
at case 088. Both pnfs block and NFSv4 failed at the same case. Did
anyone run xfstests over NFSv4 before? I'm wondering whether it is a
regression or if case 088 is valid for NFSv4.
088      - output mismatch (see 088.out.bad)
--- 088.out     2011-07-29 07:33:58.180218573 -0400
+++ 088.out.bad 2011-07-29 08:50:55.242319901 -0400
@@ -1,9 +1,2 @@
 QA output created by 088
-access(TEST_DIR/t_access, 0) returns 0
-access(TEST_DIR/t_access, R_OK) returns 0
-access(TEST_DIR/t_access, W_OK) returns 0
-access(TEST_DIR/t_access, X_OK) returns -1
-access(TEST_DIR/t_access, R_OK | W_OK) returns 0
-access(TEST_DIR/t_access, R_OK | X_OK) returns -1
-access(TEST_DIR/t_access, W_OK | X_OK) returns -1
-access(TEST_DIR/t_access, R_OK | W_OK | X_OK) returns -1
+fchown: Invalid argument

-- 
Thanks,
Tao

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-29 17:45   ` Peng Tao
@ 2011-07-29 18:44     ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2011-07-29 18:44 UTC (permalink / raw)
  To: Peng Tao
  Cc: Christoph Hellwig, Jim Rees, Trond Myklebust, linux-nfs, peter honeyman

On Sat, Jul 30, 2011 at 01:45:29AM +0800, Peng Tao wrote:
> After fixing common.rc, "./check -nfs" can run but failed and stopped
> at case 088. Both pnfs block and NFSv4 failed at the same case. Did
> anyone run xfstests over NFSv4 before? I'm wondering whether it is a
> regression or if case 088 is valid for NFSv4.

I've only tested it on NFSv3, and can't remember 088 failing.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-29 15:51 ` [PATCH v4 00/27] add block layout driver to pnfs client Christoph Hellwig
  2011-07-29 17:45   ` Peng Tao
@ 2011-07-29 18:54   ` Jim Rees
  2011-07-29 19:01     ` Christoph Hellwig
  1 sibling, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-07-29 18:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Trond Myklebust, linux-nfs, peter honeyman

Christoph Hellwig wrote:

  How well is the I/O code tested?  It's a full reimplementation of
  code full of nasty traps.  Did you run xfstests over it?  It supports
  nfs, so pointing it to a pnfs share should probably just work.

The current version of the code has been tested with Connectathon and
iozone.  Previous versions have been tested with the above plus various
other test suites and everyday use like kernel builds.

xfstests does require a small patch to work with NFSv4, which I can supply
if anyone is interested.

I can't test the current code with xfstests because NFS 4.1 without pnfs
doesn't pass these tests.  Here is what I get (test with block layout is
similar but without the hung task):

rhcl1# ./check -nfs
FSTYP         -- nfs
PLATFORM      -- Linux/x86_64 rhcl1 3.0.0-blk

001      - output mismatch (see 001.out.bad)
--- 001.out     2011-07-29 12:11:34.057245055 -0400
+++ 001.out.bad 2011-07-29 14:41:36.697152750 -0400
@@ -1,9 +1,4 @@
 QA output created by 001
 cleanup
-setup ....................................
-iter 1 chain ... check ....................................
-iter 2 chain ... check ....................................
-iter 3 chain ... check ....................................
-iter 4 chain ... check ....................................
-iter 5 chain ... check ....................................
+001 not run: this test requires a valid host fs for $SCRATCH_DEV
 cleanup
002      [not run] this test requires a valid host fs for $SCRATCH_DEV
003      [not run] not suitable for this filesystem type: nfs
004      [not run] not suitable for this filesystem type: nfs
005      [not run] this test requires a valid host fs for $SCRATCH_DEV
006      [not run] this test requires a valid host fs for $SCRATCH_DEV
007      [not run] this test requires a valid host fs for $SCRATCH_DEV
008      [not run] not suitable for this filesystem type: nfs
009      [not run] not suitable for this filesystem type: nfs
010      [not run] dbtest was not built for this platform
011      [not run] this test requires a valid host fs for $SCRATCH_DEV
012      [not run] not suitable for this filesystem type: nfs
013      [not run] this test requires a valid host fs for $SCRATCH_DEV
014      [not run] this test requires a valid host fs for $SCRATCH_DEV
015      [not run] not suitable for this filesystem type: nfs
016      [not run] not suitable for this filesystem type: nfs
017      [not run] not suitable for this filesystem type: nfs
018      [not run] not suitable for this filesystem type: nfs
019      [not run] not suitable for this filesystem type: nfs
020      [not run] not suitable for this filesystem type: nfs
021      [not run] not suitable for this filesystem type: nfs
022      [not run] xfsdump not found
023      [not run] xfsdump not found
024      [not run] xfsdump not found
025      [not run] xfsdump not found
026      [not run] xfsdump not found
027      [not run] xfsdump not found
028      [not run] xfsdump not found
029      [not run] not suitable for this filesystem type: nfs
030      [not run] not suitable for this filesystem type: nfs
031      [not run] not suitable for this filesystem type: nfs
032      [not run] not suitable for this filesystem type: nfs
033      [not run] not suitable for this filesystem type: nfs
034      [not run] not suitable for this filesystem type: nfs
035      [not run] xfsdump not found
036      [not run] xfsdump not found
037      [not run] xfsdump not found
038      [not run] xfsdump not found
039      [not run] xfsdump not found
040      [not run] Can't run srcdiff without KWORKAREA set
041      [not run] not suitable for this filesystem type: nfs
042      [not run] not suitable for this filesystem type: nfs
043      [not run] xfsdump not found
044      [not run] not suitable for this filesystem type: nfs
045      [not run] not suitable for this filesystem type: nfs
046      [not run] xfsdump not found
047      [not run] xfsdump not found
048      [not run] not suitable for this filesystem type: nfs
049      [not run] not suitable for this filesystem type: nfs
050      [not run] not suitable for this filesystem type: nfs
051      [not run] not suitable for this filesystem type: nfs
052      [not run] not suitable for this filesystem type: nfs
053      [not run] this test requires a valid $SCRATCH_DEV
054      [not run] not suitable for this filesystem type: nfs
055      [not run] xfsdump not found
056      [not run] xfsdump not found
057      [not run] Place holder for IRIX test 057
058      [not run] Place holder for IRIX test 058
059      [not run] Place holder for IRIX test 059
060      [not run] Place holder for IRIX test 060
061      [not run] xfsdump not found
062      [not run] this test requires a valid $SCRATCH_DEV
063      [not run] xfsdump not found
064      [not run] xfsdump not found
065      [not run] xfsdump not found
066      [not run] xfsdump not found
067      [not run] not suitable for this filesystem type: nfs
068      [not run] not suitable for this filesystem type: nfs
069      [not run] this test requires a valid $SCRATCH_DEV
070      [not run] attrs not supported by this filesystem type: nfs
071      [not run] not suitable for this filesystem type: nfs
072      [not run] not suitable for this filesystem type: nfs
073      [not run] not suitable for this filesystem type: nfs
074      [not run] this test requires a valid host fs for $SCRATCH_DEV
075      [not run] this test requires a valid host fs for $SCRATCH_DEV
076      [not run] this test requires a valid $SCRATCH_DEV
077      [not run] attrs not supported by this filesystem type: nfs
078      [not run] not suitable for this filesystem type: nfs
079      [not run] not suitable for this filesystem type: nfs
080      [not run] not suitable for this filesystem type: nfs
081      [not run] not suitable for this filesystem type: nfs
082      [not run] not suitable for this filesystem type: nfs
083      [not run] not suitable for this filesystem type: nfs
084      [not run] not suitable for this filesystem type: nfs
085      [not run] not suitable for this filesystem type: nfs
086      [not run] not suitable for this filesystem type: nfs
087      [not run] not suitable for this filesystem type: nfs
088      - output mismatch (see 088.out.bad)
--- 088.out     2011-07-29 12:11:34.085247833 -0400
+++ 088.out.bad 2011-07-29 14:41:48.336307595 -0400
@@ -1,9 +1,2 @@
 QA output created by 088
-access(TEST_DIR/t_access, 0) returns 0
-access(TEST_DIR/t_access, R_OK) returns 0
-access(TEST_DIR/t_access, W_OK) returns 0
-access(TEST_DIR/t_access, X_OK) returns -1
-access(TEST_DIR/t_access, R_OK | W_OK) returns 0
-access(TEST_DIR/t_access, R_OK | X_OK) returns -1
-access(TEST_DIR/t_access, W_OK | X_OK) returns -1
-access(TEST_DIR/t_access, R_OK | W_OK | X_OK) returns -1
+fchown: Invalid argument
089
Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
 kernel:------------[ cut here ]------------

Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
 kernel:invalid opcode: 0000 [#1] SMP 

Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
 kernel:Stack:

Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
 kernel:Call Trace:

Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
 kernel:Code: 48 89 e5 53 41 52 48 8b 9f a8 02 00 00 48 8d bb 88 01 00 00 e8 23 39 19 e1 8b 83 5c 02 00 00 ff c0 85 c0 89 83 5c 02 00 00 74 02 <0f> 0b 66 ff 83 88 01 00 00 41 59 5b 5d c3 55 48 89 e5 41 57 41 

INFO: task t_mtab:13810 blocked for more than 10 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
t_mtab          D 0000000000000000     0 13810  13684 0x00000080
 ffff880037b05c38 0000000000000086 ffff88007ae46dc8 ffff880000000000
 ffff88007b02ae00 ffff880037b05fd8 ffff880037b05fd8 0000000000012c40
 ffffffff81a0c020 ffff88007b02ae00 ffff880037b05c38 ffffffffa0281a2f
Call Trace:
 [<ffffffffa0281a2f>] ? __put_nfs_open_context+0x35/0xad [nfs]
 [<ffffffff8143b623>] __mutex_lock_common+0xfd/0x15e
 [<ffffffff8143b799>] __mutex_lock_slowpath+0x16/0x18
 [<ffffffff8143b737>] mutex_lock+0x1e/0x32
 [<ffffffff811082fd>] ? walk_component+0x36d/0x3b1
 [<ffffffff811d738c>] ima_file_check+0x53/0x119
 [<ffffffff81109c2d>] do_last+0x44d/0x57c
 [<ffffffff81108bf6>] ? path_init+0x196/0x29d
 [<ffffffff8110a6e3>] path_openat+0xca/0x30b
 [<ffffffff8109a316>] ? call_rcu_sched+0x10/0x12
 [<ffffffff8110a972>] do_filp_open+0x33/0x81
 [<ffffffff8143af5c>] ? _cond_resched+0x9/0x1d
 [<ffffffff81113920>] ? alloc_fd+0x6d/0x118
 [<ffffffff810fe69b>] do_sys_open+0x69/0xfb
 [<ffffffff810907f2>] ? audit_syscall_entry+0x140/0x16c
 [<ffffffff810fe748>] sys_open+0x1b/0x1d
 [<ffffffff81442592>] system_call_fastpath+0x16/0x1b
INFO: task t_mtab:13812 blocked for more than 10 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
t_mtab          D ffff88007b291c00     0 13812  13684 0x00000080
 ffff880037d97c88 0000000000000082 ffff880037f099d8 ffff88007ae46dc8
 ffff880037b81700 ffff880037d97fd8 ffff880037d97fd8 0000000000012c40
 ffff88007b02ae00 ffff880037b81700 ffff880037d97c58 ffffffffa01ffa64
Call Trace:
 [<ffffffffa01ffa64>] ? generic_lookup_cred+0x10/0x12 [sunrpc]
 [<ffffffff8143b623>] __mutex_lock_common+0xfd/0x15e
 [<ffffffff8143b799>] __mutex_lock_slowpath+0x16/0x18
 [<ffffffff8143b737>] mutex_lock+0x1e/0x32
 [<ffffffff81107a9c>] ? audit_inode+0x15/0x28
 [<ffffffff8110998f>] do_last+0x1af/0x57c
 [<ffffffff81108bf6>] ? path_init+0x196/0x29d
 [<ffffffff8110a6e3>] path_openat+0xca/0x30b
 [<ffffffffa02a545b>] ? __nfs4_close+0xfc/0x108 [nfs]
 [<ffffffff8110a972>] do_filp_open+0x33/0x81
 [<ffffffff8143af5c>] ? _cond_resched+0x9/0x1d
 [<ffffffff81113920>] ? alloc_fd+0x6d/0x118
 [<ffffffff810fe69b>] do_sys_open+0x69/0xfb
 [<ffffffff810907f2>] ? audit_syscall_entry+0x140/0x16c
 [<ffffffff810fe748>] sys_open+0x1b/0x1d
 [<ffffffff81442592>] system_call_fastpath+0x16/0x1b
------------[ cut here ]------------
kernel BUG at /home/rees/linux-pnfs/fs/nfs/callback_xdr.c:775!
invalid opcode: 0000 [#1] SMP 
CPU 0 
Modules linked in: blocklayoutdriver nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand powernow_k8 freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ip6t_REJECT ib_iser rdma_cm ib_cm nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter iw_cm ib_sa ib_mad ip6_tables ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi serio_raw amd64_edac_mod pcspkr tg3 i2c_nforce2 i2c_core edac_core edac_mce_amd shpchp k8temp ipv6 autofs4 mptspi ata_generic mptscsih pata_acpi mptbase scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan]

Pid: 1503, comm: nfsv4.1-svc Not tainted 3.0.0-blk #35 HP ProLiant DL145 G2/K85NL
RIP: 0010:[<ffffffffa02a89c2>]  [<ffffffffa02a89c2>] nfs4_cb_take_slot+0x2c/0x3a [nfs]
RSP: 0018:ffff880037f89c00  EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff88006b7ffc00 RCX: 0000000000000001
RDX: 0000000000000004 RSI: ffff88006b48da40 RDI: ffff88006b7ffd88
RBP: ffff880037f89c10 R08: 0000000000000000 R09: ffff880071a01db8
R10: ffff880071a01ca8 R11: ffff880037f89e40 R12: ffff88006bbb9800
R13: 0000000000000000 R14: ffff88006b716800 R15: ffff88006b7fec00
FS:  00007f78a8d7d720(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003ff9064c60 CR3: 000000006b341000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process nfsv4.1-svc (pid: 1503, threadinfo ffff880037f88000, task ffff88007b060000)
Stack:
 ffff880071a01ca8 ffff88006b716800 ffff880037f89ca0 ffffffffa02a942f
 ffff88006b716800 ffff880071f33098 ffff880037f89ca0 ffff88006b48da40
 0000000137f89c50 ffff88006b716808 ffff88006b7ffc58 ffff880037f89d60
Call Trace:
 [<ffffffffa02a942f>] nfs4_callback_sequence+0x264/0x32c [nfs]
 [<ffffffffa02a80f9>] nfs4_callback_compound+0x36a/0x4e5 [nfs]
 [<ffffffffa01fff27>] svc_process_common+0x253/0x4d0 [sunrpc]
 [<ffffffffa0200279>] bc_svc_process+0xd5/0xfe [sunrpc]
 [<ffffffffa02a74b3>] nfs41_callback_svc+0xd5/0x126 [nfs]
 [<ffffffff81063928>] ? remove_wait_queue+0x35/0x35
 [<ffffffffa02a73de>] ? param_set_portnr+0x47/0x47 [nfs]
 [<ffffffff8106328a>] kthread+0x7f/0x87
 [<ffffffff81444714>] kernel_thread_helper+0x4/0x10
 [<ffffffff8106320b>] ? kthread_worker_fn+0x143/0x143
 [<ffffffff81444710>] ? gs_change+0x13/0x13
Code: 48 89 e5 53 41 52 48 8b 9f a8 02 00 00 48 8d bb 88 01 00 00 e8 23 39 19 e1 8b 83 5c 02 00 00 ff c0 85 c0 89 83 5c 02 00 00 74 02 <0f> 0b 66 ff 83 88 01 00 00 41 59 5b 5d c3 55 48 89 e5 41 57 41 
RIP  [<ffffffffa02a89c2>] nfs4_cb_take_slot+0x2c/0x3a [nfs]
 RSP <ffff880037f89c00>
---[ end trace 76c6d9f5d46ae22e ]---
Callback slot table overflowed

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-29 18:54   ` Jim Rees
@ 2011-07-29 19:01     ` Christoph Hellwig
  2011-07-29 19:13       ` Jim Rees
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2011-07-29 19:01 UTC (permalink / raw)
  To: Jim Rees; +Cc: Christoph Hellwig, Trond Myklebust, linux-nfs, peter honeyman

On Fri, Jul 29, 2011 at 02:54:15PM -0400, Jim Rees wrote:
> xfstests does require a small patch to work with NFSv4, which I can supply
> if anyone is interested.

Please send it to xfs@oss.sgi.com with a proper description and signoff.

> PLATFORM      -- Linux/x86_64 rhcl1 3.0.0-blk
> 
> 001      - output mismatch (see 001.out.bad)
> -iter 5 chain ... check ....................................
> +001 not run: this test requires a valid host fs for $SCRATCH_DEV
>  cleanup
> 002      [not run] this test requires a valid host fs for $SCRATCH_DEV
> 003      [not run] not suitable for this filesystem type: nfs
> 004      [not run] not suitable for this filesystem type: nfs
> 005      [not run] this test requires a valid host fs for $SCRATCH_DEV
> 006      [not run] this test requires a valid host fs for $SCRATCH_DEV
> 007      [not run] this test requires a valid host fs for $SCRATCH_DEV

It seems like you didn't set up the SCRATCH_DEV variable properly.

> Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>  kernel:------------[ cut here ]------------
> 
> Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>  kernel:invalid opcode: 0000 [#1] SMP 
> 
> Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>  kernel:Stack:
> 
> Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>  kernel:Call Trace:
> 
> Message from syslogd@rhcl1 at Jul 29 14:42:05 ...

Looks like we did find a bug in NFS.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-29 19:01     ` Christoph Hellwig
@ 2011-07-29 19:13       ` Jim Rees
  2011-07-30  1:09         ` Trond Myklebust
  0 siblings, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-07-29 19:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Trond Myklebust, linux-nfs, peter honeyman

Christoph Hellwig wrote:

  On Fri, Jul 29, 2011 at 02:54:15PM -0400, Jim Rees wrote:
  > xfstests does require a small patch to work with NFSv4, which I can supply
  > if anyone is interested.
  
  Please send it to xfs@oss.sgi.com with a proper description and signoff.

I don't have a proper patch, just one that works for me.  But I'll send a
bug report.  It has to do with the mismatch between nfs and nfs4 mount
types, so it's not really a xfstests bug.  I think this will fix itself when
the nfsvers=4 changes fully propagate.

  It seems like you didn't set up the SCRATCH_DEV variable properly.

I was just skipping those tests so I could get to the one that fails.  I've
also tested with the SCRATCH_DEV tests in other runs.

  > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
  >  kernel:------------[ cut here ]------------
  > 
  > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
  >  kernel:invalid opcode: 0000 [#1] SMP 
  > 
  > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
  >  kernel:Stack:
  > 
  > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
  >  kernel:Call Trace:
  > 
  > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
  
  Looks like we did find a bug in NFS.

It kind of looks that way.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-29 19:13       ` Jim Rees
@ 2011-07-30  1:09         ` Trond Myklebust
  2011-07-30  3:26           ` Jim Rees
  2011-07-30 14:18           ` Jim Rees
  0 siblings, 2 replies; 63+ messages in thread
From: Trond Myklebust @ 2011-07-30  1:09 UTC (permalink / raw)
  To: Jim Rees; +Cc: Christoph Hellwig, linux-nfs, peter honeyman

On Fri, 2011-07-29 at 15:13 -0400, Jim Rees wrote: 
> Christoph Hellwig wrote:
> 
>   On Fri, Jul 29, 2011 at 02:54:15PM -0400, Jim Rees wrote:
>   > xfstests does require a small patch to work with NFSv4, which I can supply
>   > if anyone is interested.
>   
>   Please send it to xfs@oss.sgi.com with a proper description and signoff.
> 
> I don't have a proper patch, just one that works for me.  But I'll send a
> bug report.  It has to do with the mismatch between nfs and nfs4 mount
> types, so it's not really a xfstests bug.  I think this will fix itself when
> the nfsvers=4 changes fully propagate.
> 
>   It seems like you didn't set up the SCRATCH_DEV variable properly.
> 
> I was just skipping those tests so I could get to the one that fails.  I've
> also tested with the SCRATCH_DEV tests in other runs.
> 
>   > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>   >  kernel:------------[ cut here ]------------
>   > 
>   > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>   >  kernel:invalid opcode: 0000 [#1] SMP 
>   > 
>   > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>   >  kernel:Stack:
>   > 
>   > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>   >  kernel:Call Trace:
>   > 
>   > Message from syslogd@rhcl1 at Jul 29 14:42:05 ...
>   
>   Looks like we did find a bug in NFS.
> 
> It kind of looks that way.

Is that reproducible on the upstream kernel, or is it something that is
being introduced by the pNFS blocks code?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-30  1:09         ` Trond Myklebust
@ 2011-07-30  3:26           ` Jim Rees
  2011-07-30 14:25             ` Peng Tao
  2011-07-30 14:18           ` Jim Rees
  1 sibling, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-07-30  3:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Christoph Hellwig, linux-nfs, peter honeyman

Trond Myklebust wrote:

  >   Looks like we did find a bug in NFS.
  > 
  > It kind of looks that way.
  
  Is that reproducible on the upstream kernel, or is it something that is
  being introduced by the pNFS blocks code?

It happens without the blocks module loaded, but it could be from something
we did outside the module.  I will test this weekend when I get a chance.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-30  1:09         ` Trond Myklebust
  2011-07-30  3:26           ` Jim Rees
@ 2011-07-30 14:18           ` Jim Rees
  1 sibling, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-07-30 14:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Christoph Hellwig, linux-nfs, peter honeyman

Trond Myklebust wrote:

  Is that reproducible on the upstream kernel, or is it something that is
  being introduced by the pNFS blocks code?

Upstream kernel 3.0.0-next-20110729 fails in a similar way, so it's not
anything introduced by the block layout code.

kernel BUG at /home/rees/linux-next/fs/nfs/callback_xdr.c:775!
invalid opcode: 0000 [#1] SMP 
CPU 0 
Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand powernow_k8 freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i ip6t_REJECT libcxgbi cxgb3 nf_conntrack_ipv6 nf_defrag_ipv6 mdio ib_iser ip6table_filter ip6_tables rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi amd64_edac_mod i2c_nforce2 shpchp tg3 pcspkr edac_core i2c_core serio_raw k8temp edac_mce_amd ipv6 autofs4 ata_generic pata_acpi mptspi mptscsih mptbase scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan]

Pid: 6494, comm: nfsv4.1-svc Tainted: G        W   3.0.0-next-20110729 #2 HP ProLiant DL145 G2/K85NL
RIP: 0010:[<ffffffffa02998df>]  [<ffffffffa02998df>] nfs4_cb_take_slot+0x2e/0x3e [nfs]
RSP: 0018:ffff880074275bf0  EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff88007b67ec00 RCX: 0000000000000001
RDX: 000000000000001c RSI: ffff88006b8b49f0 RDI: ffff88007b67ed88
RBP: ffff880074275c00 R08: 00000000000000d0 R09: 0000000000000002
R10: ffff88007fc12e70 R11: ffffffff81b42ab0 R12: ffff880037e72000
R13: 0000000000000000 R14: ffff880037ed3800 R15: ffff88007b67e800
FS:  00007fb17e8de720(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003ff9064c60 CR3: 0000000069a99000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process nfsv4.1-svc (pid: 6494, threadinfo ffff880074274000, task ffff880069410000)
Stack:
 ffff880037e72000 ffff880037ed3800 ffff880074275ca0 ffffffffa029a5c2
 ffff88006b8b49e0 ffff880071786098 ffff880074275ca0 ffffffffa0299a1a
 ffff88006b8b49e0 0000000100000246 ffff88007b67ec48 ffff880037ed3808
Call Trace:
 [<ffffffffa029a5c2>] nfs4_callback_sequence+0x272/0x338 [nfs]
 [<ffffffffa0299a1a>] ? decode_cb_sequence_args+0x12b/0x24a [nfs]
 [<ffffffffa0299733>] nfs4_callback_compound+0x364/0x4e2 [nfs]
 [<ffffffff8106f3bf>] ? groups_alloc+0x38/0xbe
 [<ffffffffa01f7206>] svc_process_common+0x260/0x4d3 [sunrpc]
 [<ffffffffa01f7552>] bc_svc_process+0xd9/0x102 [sunrpc]
 [<ffffffffa029884c>] nfs41_callback_svc+0xd5/0x126 [nfs]
 [<ffffffff8106905f>] ? wake_up_bit+0x25/0x25
 [<ffffffffa0298777>] ? nfs_callback_down+0x7c/0x7c [nfs]
 [<ffffffff81068bf8>] kthread+0x7d/0x85
 [<ffffffff81467e14>] kernel_thread_helper+0x4/0x10
 [<ffffffff81068b7b>] ? kthread_worker_fn+0x147/0x147
 [<ffffffff81467e10>] ? gs_change+0x13/0x13
Code: e5 53 48 83 ec 08 48 8b 9f a8 02 00 00 48 8d bb 88 01 00 00 e8 c4 5b 1c e1 8b 83 5c 02 00 00 ff c0 85 c0 89 83 5c 02 00 00 74 04 <0f> 0b eb fe 66 ff 83 88 01 00 00 41 5b 5b c9 c3 55 48 8d 42 08 
RIP  [<ffffffffa02998df>] nfs4_cb_take_slot+0x2e/0x3e [nfs]
 RSP <ffff880074275bf0>
---[ end trace 27e49b345894527a ]---
Callback slot table overflowed

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-30  3:26           ` Jim Rees
@ 2011-07-30 14:25             ` Peng Tao
  2011-08-01 21:10               ` Trond Myklebust
  0 siblings, 1 reply; 63+ messages in thread
From: Peng Tao @ 2011-07-30 14:25 UTC (permalink / raw)
  To: Jim Rees, Trond Myklebust; +Cc: Christoph Hellwig, linux-nfs, peter honeyman

On Sat, Jul 30, 2011 at 11:26 AM, Jim Rees <rees@umich.edu> wrote:
> Trond Myklebust wrote:
>
>  >   Looks like we did find a bug in NFS.
>  >
>  > It kind of looks that way.
>
>  Is that reproducible on the upstream kernel, or is it something that is
>  being introduced by the pNFS blocks code?
>
> It happens without the blocks module loaded, but it could be from something
> we did outside the module.  I will test this weekend when I get a chance.
I tried xfstests again and was able to reproduce a hang on both block
layout and file layout (upstream commit ed1e62, w/o block layout
code). It seems it is a bug in pnfs code. I did not see it w/ NFSv4.
For pnfs block and file layout, it can be reproduced by just running
xfstests with ./check -nfs. It does not show up every time but is
likely to happen in less than 10 runs.  Not sure if it is the same one
Jim reported though.

block layout trace:
[  660.039009] BUG: soft lockup - CPU#1 stuck for 22s! [10.244.82.74-ma:29730]
[  660.039014] Modules linked in: blocklayoutdriver nfs lockd fscache
auth_rpcgss nfs_acl ebtable_na
t ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM
iptable_mangle bridge stp llc sunrpc be2isc
si ip6t_REJECT iscsi_boot_sysfs nf_conntrack_ipv6 nf_defrag_ipv6 bnx2i
ip6table_filter cnic uio ip6_
tables cxgb3i libcxgbi cxgb3 mdio iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi ppdev i2c_pii
x4 i2c_core pcspkr e1000 parport_pc microcode parport vmw_balloon
shpchp ipv6 floppy mptspi mptscsih
 mptbase scsi_transport_spi [last unloaded: nfs]
[  660.039014] CPU 1
[  660.039014] Modules linked in: blocklayoutdriver nfs lockd fscache
auth_rpcgss nfs_acl ebtable_na
t ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM
iptable_mangle bridge stp llc sunrpc be2isc
si ip6t_REJECT iscsi_boot_sysfs nf_conntrack_ipv6 nf_defrag_ipv6 bnx2i
ip6table_filter cnic uio ip6_
tables cxgb3i libcxgbi cxgb3 mdio iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi ppdev i2c_pii
x4 i2c_core pcspkr e1000 parport_pc microcode parport vmw_balloon
shpchp ipv6 floppy mptspi mptscsih
 mptbase scsi_transport_spi [last unloaded: nfs]
[  660.039014]
[  660.039014] Pid: 29730, comm: 10.244.82.74-ma Tainted: G      D
3.0.0-pnfs+ #2 VMware, Inc. V
Mware Virtual Platform/440BX Desktop Reference Platform
[  660.039014] RIP: 0010:[<ffffffff81084f49>]  [<ffffffff81084f49>]
do_raw_spin_lock+0x1e/0x25
[  660.039014] RSP: 0018:ffff88001fef5e60  EFLAGS: 00000297
[  660.039014] RAX: 000000000000002b RBX: ffff88003be19000 RCX: 0000000000000001
[  660.039014] RDX: 000000000000002a RSI: ffff8800219a7cf0 RDI: ffff880020e4d988
[  660.039014] RBP: ffff88001fef5e60 R08: 0000000000000000 R09: 000000000000df20
[  660.039014] R10: 0000000000000000 R11: ffff8800219a7c00 R12: ffff88001fef5df0
[  660.039014] R13: 00000000c355df1b R14: ffff88003bfaeac0 R15: ffff8800219a7c00
[  660.039014] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000)
knlGS:0000000000000000
[  660.039014] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  660.039014] CR2: 00007fc6122a4000 CR3: 0000000001a04000 CR4: 00000000000006e0
[  660.039014] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  660.039014] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  660.039014] Process 10.244.82.74-ma (pid: 29730, threadinfo
ffff88001fef4000, task ffff88001fca80
00)
[  660.039014] Stack:
[  660.039014]  ffff88001fef5e70 ffffffff814585ee ffff88001fef5e90
ffffffffa02badee
[  660.039014]  0000000000000000 ffff8800219a7c00 ffff88001fef5ee0
ffffffffa02bc2d9
[  660.039014]  ffff880000000000 ffffffffa02d2250 ffff88001fef5ee0
ffff88002059ba10
[  660.039014] Call Trace:
[  660.039014]  [<ffffffff814585ee>] _raw_spin_lock+0xe/0x10
[  660.039014]  [<ffffffffa02badee>] nfs4_begin_drain_session+0x24/0x8f [nfs]
[  660.039014]  [<ffffffffa02bc2d9>] nfs4_run_state_manager+0x271/0x517 [nfs]
[  660.039014]  [<ffffffffa02bc068>] ? nfs4_do_reclaim+0x422/0x422 [nfs]
[  660.039014]  [<ffffffff810719bf>] kthread+0x84/0x8c
[  660.039014]  [<ffffffff81460f54>] kernel_thread_helper+0x4/0x10
[  660.039014]  [<ffffffff8107193b>] ? kthread_worker_fn+0x148/0x148
[  660.039014]  [<ffffffff81460f50>] ? gs_change+0x13/0x13
[  660.039014] Code: 00 00 10 00 74 05 e8 a7 59 1b 00 5d c3 55 48 89
e5 66 66 66 66 90 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2
74 07 f3 90 <0f> b7 17 eb f5 5d c3 55 48 89 e5 66 66 66 66 90 8b 07 89
c2 c1
[  660.039014] Call Trace:
[  660.039014]  [<ffffffff814585ee>] _raw_spin_lock+0xe/0x10
[  660.039014]  [<ffffffffa02badee>] nfs4_begin_drain_session+0x24/0x8f [nfs]
[  660.039014]  [<ffffffffa02bc2d9>] nfs4_run_state_manager+0x271/0x517 [nfs]
[  660.039014]  [<ffffffffa02bc068>] ? nfs4_do_reclaim+0x422/0x422 [nfs]
[  660.039014]  [<ffffffff810719bf>] kthread+0x84/0x8c
[  660.039014]  [<ffffffff81460f54>] kernel_thread_helper+0x4/0x10
[  660.039014]  [<ffffffff8107193b>] ? kthread_worker_fn+0x148/0x148
[  660.039014]  [<ffffffff81460f50>] ? gs_change+0x13/0x13


file layout trace:
[19716.049009] BUG: soft lockup - CPU#1 stuck for 23s! [10.244.82.76-ma:29036]
[19716.049011] Modules linked in: nfs_layout_nfsv41_files nfs lockd
fscache auth_rpcgss nfs_acl ebtable_nat ebtables ipt_MASQUERADE
iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc
ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 be2iscsi iscsi_boot_sysfs
bnx2i cnic ip6table_filter uio ip6_tables cxgb3i libcxgbi cxgb3 mdio
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev microcode
i2c_piix4 e1000 vmw_balloon parport_pc parport shpchp pcspkr i2c_core
ipv6 mptspi mptscsih mptbase scsi_transport_spi floppy [last unloaded:
nfs]
[19716.049011] CPU 1
[19716.049011] Modules linked in: nfs_layout_nfsv41_files nfs lockd
fscache auth_rpcgss nfs_acl ebtable_nat ebtables ipt_MASQUERADE
iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc
ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 be2iscsi iscsi_boot_sysfs
bnx2i cnic ip6table_filter uio ip6_tables cxgb3i libcxgbi cxgb3 mdio
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev microcode
i2c_piix4 e1000 vmw_balloon parport_pc parport shpchp pcspkr i2c_core
ipv6 mptspi mptscsih mptbase scsi_transport_spi floppy [last unloaded:
nfs]
[19716.049011]
[19716.049011] Pid: 29036, comm: 10.244.82.76-ma Tainted: G      D
3.0.0-pnfs+ #2 VMware, Inc. VMware Virtual Platform/440BX Desktop
Reference Platform
[19716.049011] RIP: 0010:[<ffffffff81084f49>]  [<ffffffff81084f49>]
do_raw_spin_lock+0x1e/0x25
[19716.049011] RSP: 0018:ffff88002a69be60  EFLAGS: 00000297
[19716.049011] RAX: 0000000000000005 RBX: ffff88002a59fd00 RCX: 0000000000000002
[19716.049011] RDX: 0000000000000004 RSI: ffff8800208c00f0 RDI: ffff8800208c1188
[19716.049011] RBP: ffff88002a69be60 R08: 0000000000000002 R09: 0000ffff00066c0a
[19716.049011] R10: 0000ffff00066c0a R11: ffff8800208c0000 R12: ffff88002a69bdf0
[19716.049011] R13: 0000000001ce15a2 R14: ffff88002a6f1f80 R15: ffff8800208c0000
[19716.049011] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000)
knlGS:0000000000000000
[19716.049011] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[19716.049011] CR2: 00007fad5ac53000 CR3: 0000000038784000 CR4: 00000000000006e0
[19716.049011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[19716.049011] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[19716.049011] Process 10.244.82.76-ma (pid: 29036, threadinfo
ffff88002a69a000, task ffff880022db9720)
[19716.049011] Stack:
[19716.049011]  ffff88002a69be70 ffffffff814585ee ffff88002a69be90
ffffffffa02be836
[19716.049011]  0000000000000002 ffff8800208c0000 ffff88002a69bee0
ffffffffa02bfd21
[19716.049011]  ffff880000000000 ffffffffa02d59c0 ffff88002a69bee0
ffff880037971ce8
[19716.049011] Call Trace:
[19716.049011]  [<ffffffff814585ee>] _raw_spin_lock+0xe/0x10
[19716.049011]  [<ffffffffa02be836>] nfs4_begin_drain_session+0x24/0x8f [nfs]
[19716.049011]  [<ffffffffa02bfd21>] nfs4_run_state_manager+0x271/0x517 [nfs]
[19716.049011]  [<ffffffffa02bfab0>] ? nfs4_do_reclaim+0x422/0x422 [nfs]
[19716.049011]  [<ffffffff810719bf>] kthread+0x84/0x8c
[19716.049011]  [<ffffffff81460f54>] kernel_thread_helper+0x4/0x10
[19716.049011]  [<ffffffff8107193b>] ? kthread_worker_fn+0x148/0x148
[19716.049011]  [<ffffffff81460f50>] ? gs_change+0x13/0x13
[19716.049011] Code: 00 00 10 00 74 05 e8 a7 59 1b 00 5d c3 55 48 89
e5 66 66 66 66 90 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2
74 07 f3 90 <0f> b7 17 eb f5 5d c3 55 48 89 e5 66 66 66 66 90 8b 07 89
c2 c1
[19716.049011] Call Trace:
[19716.049011] Call Trace:
[19716.049011]  [<ffffffff814585ee>] _raw_spin_lock+0xe/0x10
[19716.049011]  [<ffffffffa02be836>] nfs4_begin_drain_session+0x24/0x8f [nfs]
[19716.049011]  [<ffffffffa02bfd21>] nfs4_run_state_manager+0x271/0x517 [nfs]
[19716.049011]  [<ffffffffa02bfab0>] ? nfs4_do_reclaim+0x422/0x422 [nfs]
[19716.049011]  [<ffffffff810719bf>] kthread+0x84/0x8c
[19716.049011]  [<ffffffff81460f54>] kernel_thread_helper+0x4/0x10
[19716.049011]  [<ffffffff8107193b>] ? kthread_worker_fn+0x148/0x148
[19716.049011]  [<ffffffff81460f50>] ? gs_change+0x13/0x13

-- 
Thanks,
Tao

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-07-30 14:25             ` Peng Tao
@ 2011-08-01 21:10               ` Trond Myklebust
  2011-08-01 22:35                 ` Trond Myklebust
  0 siblings, 1 reply; 63+ messages in thread
From: Trond Myklebust @ 2011-08-01 21:10 UTC (permalink / raw)
  To: Peng Tao, William Andros Adamson
  Cc: Jim Rees, Christoph Hellwig, linux-nfs, peter honeyman

On Sat, 2011-07-30 at 22:25 +0800, Peng Tao wrote: 
> On Sat, Jul 30, 2011 at 11:26 AM, Jim Rees <rees@umich.edu> wrote:
> > Trond Myklebust wrote:
> >
> >  >   Looks like we did find a bug in NFS.
> >  >
> >  > It kind of looks that way.
> >
> >  Is that reproducible on the upstream kernel, or is it something that is
> >  being introduced by the pNFS blocks code?
> >
> > It happens without the blocks module loaded, but it could be from something
> > we did outside the module.  I will test this weekend when I get a chance.
> I tried xfstests again and was able to reproduce a hang on both block
> layout and file layout (upstream commit ed1e62, w/o block layout
> code). It seems it is a bug in pnfs code. I did not see it w/ NFSv4.
> For pnfs block and file layout, it can be reproduced by just running
> xfstests with ./check -nfs. It does not show up every time but is
> likely to happen in less than 10 runs.  Not sure if it is the same one
> Jim reported though.
> 
> block layout trace:
> [  660.039009] BUG: soft lockup - CPU#1 stuck for 22s! [10.244.82.74-ma:29730]
> [  660.039014] Modules linked in: blocklayoutdriver nfs lockd fscache
> auth_rpcgss nfs_acl ebtable_na
> t ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM
> iptable_mangle bridge stp llc sunrpc be2isc
> si ip6t_REJECT iscsi_boot_sysfs nf_conntrack_ipv6 nf_defrag_ipv6 bnx2i
> ip6table_filter cnic uio ip6_
> tables cxgb3i libcxgbi cxgb3 mdio iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi ppdev i2c_pii
> x4 i2c_core pcspkr e1000 parport_pc microcode parport vmw_balloon
> shpchp ipv6 floppy mptspi mptscsih
>  mptbase scsi_transport_spi [last unloaded: nfs]
> [  660.039014] CPU 1
> [  660.039014] Modules linked in: blocklayoutdriver nfs lockd fscache
> auth_rpcgss nfs_acl ebtable_na
> t ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM
> iptable_mangle bridge stp llc sunrpc be2isc
> si ip6t_REJECT iscsi_boot_sysfs nf_conntrack_ipv6 nf_defrag_ipv6 bnx2i
> ip6table_filter cnic uio ip6_
> tables cxgb3i libcxgbi cxgb3 mdio iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi ppdev i2c_pii
> x4 i2c_core pcspkr e1000 parport_pc microcode parport vmw_balloon
> shpchp ipv6 floppy mptspi mptscsih
>  mptbase scsi_transport_spi [last unloaded: nfs]
> [  660.039014]
> [  660.039014] Pid: 29730, comm: 10.244.82.74-ma Tainted: G      D
> 3.0.0-pnfs+ #2 VMware, Inc. V
> Mware Virtual Platform/440BX Desktop Reference Platform
> [  660.039014] RIP: 0010:[<ffffffff81084f49>]  [<ffffffff81084f49>]
> do_raw_spin_lock+0x1e/0x25
> [  660.039014] RSP: 0018:ffff88001fef5e60  EFLAGS: 00000297
> [  660.039014] RAX: 000000000000002b RBX: ffff88003be19000 RCX: 0000000000000001
> [  660.039014] RDX: 000000000000002a RSI: ffff8800219a7cf0 RDI: ffff880020e4d988
> [  660.039014] RBP: ffff88001fef5e60 R08: 0000000000000000 R09: 000000000000df20
> [  660.039014] R10: 0000000000000000 R11: ffff8800219a7c00 R12: ffff88001fef5df0
> [  660.039014] R13: 00000000c355df1b R14: ffff88003bfaeac0 R15: ffff8800219a7c00
> [  660.039014] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000)
> knlGS:0000000000000000
> [  660.039014] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  660.039014] CR2: 00007fc6122a4000 CR3: 0000000001a04000 CR4: 00000000000006e0
> [  660.039014] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  660.039014] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  660.039014] Process 10.244.82.74-ma (pid: 29730, threadinfo
> ffff88001fef4000, task ffff88001fca80
> 00)
> [  660.039014] Stack:
> [  660.039014]  ffff88001fef5e70 ffffffff814585ee ffff88001fef5e90
> ffffffffa02badee
> [  660.039014]  0000000000000000 ffff8800219a7c00 ffff88001fef5ee0
> ffffffffa02bc2d9
> [  660.039014]  ffff880000000000 ffffffffa02d2250 ffff88001fef5ee0
> ffff88002059ba10
> [  660.039014] Call Trace:
> [  660.039014]  [<ffffffff814585ee>] _raw_spin_lock+0xe/0x10
> [  660.039014]  [<ffffffffa02badee>] nfs4_begin_drain_session+0x24/0x8f [nfs]
> [  660.039014]  [<ffffffffa02bc2d9>] nfs4_run_state_manager+0x271/0x517 [nfs]
> [  660.039014]  [<ffffffffa02bc068>] ? nfs4_do_reclaim+0x422/0x422 [nfs]
> [  660.039014]  [<ffffffff810719bf>] kthread+0x84/0x8c
> [  660.039014]  [<ffffffff81460f54>] kernel_thread_helper+0x4/0x10
> [  660.039014]  [<ffffffff8107193b>] ? kthread_worker_fn+0x148/0x148
> [  660.039014]  [<ffffffff81460f50>] ? gs_change+0x13/0x13
> [  660.039014] Code: 00 00 10 00 74 05 e8 a7 59 1b 00 5d c3 55 48 89
> e5 66 66 66 66 90 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2
> 74 07 f3 90 <0f> b7 17 eb f5 5d c3 55 48 89 e5 66 66 66 66 90 8b 07 89
> c2 c1

OK...

Looking at the callback code, I see that if tbl->highest_used_slotid !=
0, then we BUG() while holding the backchannel's tbl->slot_tbl_lock
spinlock. That seems a likely candidate for the above hang.

Andy, how we are guaranteed that tbl->highest_used_slotid won't take
values other than 0, and why do we commit suicide when it does? As far
as I can see, there is no guarantee that we call nfs4_cb_take_slot() in
nfs4_callback_compound(), however we appear to unconditionally call
nfs4_cb_free_slot() provided there is a session.

The other strangeness would be the fact that there is nothing enforcing
the NFS4_SESSION_DRAINING flag. If the session is draining, then the
back-channel simply ignores that and goes ahead with processing the
callback. Is this to avoid deadlocks with the server returning
NFS4ERR_BACK_CHAN_BUSY when the client does a DESTROY_SESSION?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-01 21:10               ` Trond Myklebust
@ 2011-08-01 22:35                 ` Trond Myklebust
  2011-08-01 22:57                   ` Andy Adamson
  2011-08-02  2:21                   ` [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
  0 siblings, 2 replies; 63+ messages in thread
From: Trond Myklebust @ 2011-08-01 22:35 UTC (permalink / raw)
  To: Peng Tao
  Cc: William Andros Adamson, Jim Rees, Christoph Hellwig, linux-nfs,
	peter honeyman

On Mon, 2011-08-01 at 17:10 -0400, Trond Myklebust wrote: 
> Looking at the callback code, I see that if tbl->highest_used_slotid !=
> 0, then we BUG() while holding the backchannel's tbl->slot_tbl_lock
> spinlock. That seems a likely candidate for the above hang.
> 
> Andy, how we are guaranteed that tbl->highest_used_slotid won't take
> values other than 0, and why do we commit suicide when it does? As far
> as I can see, there is no guarantee that we call nfs4_cb_take_slot() in
> nfs4_callback_compound(), however we appear to unconditionally call
> nfs4_cb_free_slot() provided there is a session.
> 
> The other strangeness would be the fact that there is nothing enforcing
> the NFS4_SESSION_DRAINING flag. If the session is draining, then the
> back-channel simply ignores that and goes ahead with processing the
> callback. Is this to avoid deadlocks with the server returning
> NFS4ERR_BACK_CHAN_BUSY when the client does a DESTROY_SESSION?

How about something like the following?

8<------------------------------------------------------------------------------- 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-01 22:35                 ` Trond Myklebust
@ 2011-08-01 22:57                   ` Andy Adamson
  2011-08-01 23:11                     ` Trond Myklebust
  2011-08-02  2:21                   ` [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
  1 sibling, 1 reply; 63+ messages in thread
From: Andy Adamson @ 2011-08-01 22:57 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Peng Tao, Jim Rees, Christoph Hellwig, linux-nfs, peter honeyman


On Aug 1, 2011, at 6:35 PM, Trond Myklebust wrote:

> On Mon, 2011-08-01 at 17:10 -0400, Trond Myklebust wrote: 
>> Looking at the callback code, I see that if tbl->highest_used_slotid !=
>> 0, then we BUG() while holding the backchannel's tbl->slot_tbl_lock
>> spinlock. That seems a likely candidate for the above hang.
>> 
>> Andy, how we are guaranteed that tbl->highest_used_slotid won't take
>> values other than 0, and why do we commit suicide when it does? As far
>> as I can see, there is no guarantee that we call nfs4_cb_take_slot() in
>> nfs4_callback_compound(), however we appear to unconditionally call
>> nfs4_cb_free_slot() provided there is a session.
>> 
>> The other strangeness would be the fact that there is nothing enforcing
>> the NFS4_SESSION_DRAINING flag. If the session is draining, then the
>> back-channel simply ignores that and goes ahead with processing the
>> callback. Is this to avoid deadlocks with the server returning
>> NFS4ERR_BACK_CHAN_BUSY when the client does a DESTROY_SESSION?


When NFS4_SESSION_DRAINING is set, the backchannel is drained first - and waits for current processing to complete signaled by the highest_used_slotid == -1. Any other backchannel requests that occur under the NFS4_SESSION_DRAINING flag are processed - but just the CB_SEQUENCE operation which returns NFS4ERR_DELAY.  It does indeed prevent the BACK_CHAN_BUSY deadlock.

> 
> How about something like the following?

Looks good. Nice catch. One change below and a comment inline

-->Andy

> 
> 8<------------------------------------------------------------------------------- 
> From c0c499b0ca9d0af8cbdc29c40effba38475461d9 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <Trond.Myklebust@netapp.com>
> Date: Mon, 1 Aug 2011 18:31:12 -0400
> Subject: [PATCH] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour
> 
> Currently, there is no guarantee that we will call nfs4_cb_take_slot() even
> though nfs4_callback_compound() will consistently call
> nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field.
> The result is that we can trigger the BUG_ON() upon the next call to
> nfs4_cb_take_slot().
> This patch fixes the above problem by using the slot id that was taken in
> the CB_SEQUENCE operation as a flag for whether or not we need to call
> nfs4_cb_free_slot().
> It also fixes an atomicity problem: we need to set tbl->highest_used_slotid
> atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up
> racing with the various tests in nfs4_begin_drain_session().

But the code below doesn't do this - it locks the backchannel slot table to check the flag, but does not set the highest_used_slot atomically with this check.

> 
> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> ---
> fs/nfs/callback.h      |    2 +-
> fs/nfs/callback_proc.c |   18 +++++++++++++-----
> fs/nfs/callback_xdr.c  |   24 +++++++-----------------
> 3 files changed, 21 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
> index b257383..07df5f1 100644
> --- a/fs/nfs/callback.h
> +++ b/fs/nfs/callback.h
> @@ -38,6 +38,7 @@ enum nfs4_callback_opnum {
> struct cb_process_state {
> 	__be32			drc_status;
> 	struct nfs_client	*clp;
> +	int			slotid;
> };
> 
> struct cb_compound_hdr_arg {
> @@ -166,7 +167,6 @@ extern unsigned nfs4_callback_layoutrecall(
> 	void *dummy, struct cb_process_state *cps);
> 
> extern void nfs4_check_drain_bc_complete(struct nfs4_session *ses);
> -extern void nfs4_cb_take_slot(struct nfs_client *clp);
> 
> struct cb_devicenotifyitem {
> 	uint32_t		cbd_notify_type;
> diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
> index 74780f9..1bd2c81 100644
> --- a/fs/nfs/callback_proc.c
> +++ b/fs/nfs/callback_proc.c
> @@ -348,7 +348,7 @@ validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
> 	/* Normal */
> 	if (likely(args->csa_sequenceid == slot->seq_nr + 1)) {
> 		slot->seq_nr++;
> -		return htonl(NFS4_OK);
> +		goto out_ok;
> 	}
> 
> 	/* Replay */
> @@ -367,11 +367,14 @@ validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
> 	/* Wraparound */
> 	if (args->csa_sequenceid == 1 && (slot->seq_nr + 1) == 0) {
> 		slot->seq_nr = 1;
> -		return htonl(NFS4_OK);
> +		goto out_ok;
> 	}
> 
> 	/* Misordered request */
> 	return htonl(NFS4ERR_SEQ_MISORDERED);
> +out_ok:
> +	tbl->highest_used_slotid = args->csa_slotid;
> +	return htonl(NFS4_OK);
> }
> 
> /*
> @@ -433,26 +436,32 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
> 			      struct cb_sequenceres *res,
> 			      struct cb_process_state *cps)
> {
> +	struct nfs4_slot_table *tbl;
> 	struct nfs_client *clp;
> 	int i;
> 	__be32 status = htonl(NFS4ERR_BADSESSION);
> 
> -	cps->clp = NULL;
> -
> 	clp = nfs4_find_client_sessionid(args->csa_addr, &args->csa_sessionid);
> 	if (clp == NULL)
> 		goto out;
> 
> +	tbl = &clp->cl_session->bc_slot_table;
> +
> +	spin_lock(&tbl->slot_tbl_lock);
> 	/* state manager is resetting the session */
> 	if (test_bit(NFS4_SESSION_DRAINING, &clp->cl_session->session_state)) {
> 		status = NFS4ERR_DELAY;
                                 
status = htonl(NFS4ERR_DELAY);


> +		spin_unlock(&tbl->slot_tbl_lock);
> 		goto out;
> 	}
> 
> 	status = validate_seqid(&clp->cl_session->bc_slot_table, args);
> +	spin_unlock(&tbl->slot_tbl_lock);
> 	if (status)
> 		goto out;
> 
> +	cps->slotid = args->csa_slotid;
> +
> 	/*
> 	 * Check for pending referring calls.  If a match is found, a
> 	 * related callback was received before the response to the original
> @@ -469,7 +478,6 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
> 	res->csr_slotid = args->csa_slotid;
> 	res->csr_highestslotid = NFS41_BC_MAX_CALLBACKS - 1;
> 	res->csr_target_highestslotid = NFS41_BC_MAX_CALLBACKS - 1;
> -	nfs4_cb_take_slot(clp);
> 
> out:
> 	cps->clp = clp; /* put in nfs4_callback_compound */
> diff --git a/fs/nfs/callback_xdr.c b/fs/nfs/callback_xdr.c
> index c6c86a7..918ad64 100644
> --- a/fs/nfs/callback_xdr.c
> +++ b/fs/nfs/callback_xdr.c
> @@ -754,26 +754,15 @@ static void nfs4_callback_free_slot(struct nfs4_session *session)
> 	 * Let the state manager know callback processing done.
> 	 * A single slot, so highest used slotid is either 0 or -1
> 	 */
> -	tbl->highest_used_slotid--;
> +	tbl->highest_used_slotid = -1;
> 	nfs4_check_drain_bc_complete(session);
> 	spin_unlock(&tbl->slot_tbl_lock);
> }
> 
> -static void nfs4_cb_free_slot(struct nfs_client *clp)
> +static void nfs4_cb_free_slot(struct cb_process_state *cps)
> {
> -	if (clp && clp->cl_session)
> -		nfs4_callback_free_slot(clp->cl_session);
> -}
> -
> -/* A single slot, so highest used slotid is either 0 or -1 */
> -void nfs4_cb_take_slot(struct nfs_client *clp)
> -{
> -	struct nfs4_slot_table *tbl = &clp->cl_session->bc_slot_table;
> -
> -	spin_lock(&tbl->slot_tbl_lock);
> -	tbl->highest_used_slotid++;
> -	BUG_ON(tbl->highest_used_slotid != 0);
> -	spin_unlock(&tbl->slot_tbl_lock);
> +	if (cps->slotid != -1)
> +		nfs4_callback_free_slot(cps->clp->cl_session);
> }
> 
> #else /* CONFIG_NFS_V4_1 */
> @@ -784,7 +773,7 @@ preprocess_nfs41_op(int nop, unsigned int op_nr, struct callback_op **op)
> 	return htonl(NFS4ERR_MINOR_VERS_MISMATCH);
> }
> 
> -static void nfs4_cb_free_slot(struct nfs_client *clp)
> +static void nfs4_cb_free_slot(struct cb_process_state *cps)
> {
> }
> #endif /* CONFIG_NFS_V4_1 */
> @@ -866,6 +855,7 @@ static __be32 nfs4_callback_compound(struct svc_rqst *rqstp, void *argp, void *r
> 	struct cb_process_state cps = {
> 		.drc_status = 0,
> 		.clp = NULL,
> +		.slotid = -1,
> 	};
> 	unsigned int nops = 0;
> 
> @@ -906,7 +896,7 @@ static __be32 nfs4_callback_compound(struct svc_rqst *rqstp, void *argp, void *r
> 
> 	*hdr_res.status = status;
> 	*hdr_res.nops = htonl(nops);
> -	nfs4_cb_free_slot(cps.clp);
> +	nfs4_cb_free_slot(&cps);
> 	nfs_put_client(cps.clp);
> 	dprintk("%s: done, status = %u\n", __func__, ntohl(status));
> 	return rpc_success;
> -- 
> 1.7.6
> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-01 22:57                   ` Andy Adamson
@ 2011-08-01 23:11                     ` Trond Myklebust
  2011-08-02 17:30                       ` Trond Myklebust
  0 siblings, 1 reply; 63+ messages in thread
From: Trond Myklebust @ 2011-08-01 23:11 UTC (permalink / raw)
  To: Andy Adamson
  Cc: Peng Tao, Jim Rees, Christoph Hellwig, linux-nfs, peter honeyman

On Mon, 2011-08-01 at 18:57 -0400, Andy Adamson wrote: 
> On Aug 1, 2011, at 6:35 PM, Trond Myklebust wrote:
> 
> > On Mon, 2011-08-01 at 17:10 -0400, Trond Myklebust wrote: 
> >> Looking at the callback code, I see that if tbl->highest_used_slotid !=
> >> 0, then we BUG() while holding the backchannel's tbl->slot_tbl_lock
> >> spinlock. That seems a likely candidate for the above hang.
> >> 
> >> Andy, how we are guaranteed that tbl->highest_used_slotid won't take
> >> values other than 0, and why do we commit suicide when it does? As far
> >> as I can see, there is no guarantee that we call nfs4_cb_take_slot() in
> >> nfs4_callback_compound(), however we appear to unconditionally call
> >> nfs4_cb_free_slot() provided there is a session.
> >> 
> >> The other strangeness would be the fact that there is nothing enforcing
> >> the NFS4_SESSION_DRAINING flag. If the session is draining, then the
> >> back-channel simply ignores that and goes ahead with processing the
> >> callback. Is this to avoid deadlocks with the server returning
> >> NFS4ERR_BACK_CHAN_BUSY when the client does a DESTROY_SESSION?
> 
> 
> When NFS4_SESSION_DRAINING is set, the backchannel is drained first - and waits for current processing to complete signaled by the highest_used_slotid == -1. Any other backchannel requests that occur under the NFS4_SESSION_DRAINING flag are processed - but just the CB_SEQUENCE operation which returns NFS4ERR_DELAY.  It does indeed prevent the BACK_CHAN_BUSY deadlock.
> 
> > 
> > How about something like the following?
> 
> Looks good. Nice catch. One change below and a comment inline
> 
> -->Andy
> 
> > 
> > 8<------------------------------------------------------------------------------- 
> > From c0c499b0ca9d0af8cbdc29c40effba38475461d9 Mon Sep 17 00:00:00 2001
> > From: Trond Myklebust <Trond.Myklebust@netapp.com>
> > Date: Mon, 1 Aug 2011 18:31:12 -0400
> > Subject: [PATCH] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour
> > 
> > Currently, there is no guarantee that we will call nfs4_cb_take_slot() even
> > though nfs4_callback_compound() will consistently call
> > nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field.
> > The result is that we can trigger the BUG_ON() upon the next call to
> > nfs4_cb_take_slot().
> > This patch fixes the above problem by using the slot id that was taken in
> > the CB_SEQUENCE operation as a flag for whether or not we need to call
> > nfs4_cb_free_slot().
> > It also fixes an atomicity problem: we need to set tbl->highest_used_slotid
> > atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up
> > racing with the various tests in nfs4_begin_drain_session().
> 
> But the code below doesn't do this - it locks the backchannel slot table to check the flag, but does not set the highest_used_slot atomically with this check.

It should. The patch ensures that we hold the backchannel slot table
lock across both the NFS4_SESSION_DRAINING test and the
validate_seqid(), and also ensures that the latter function sets
tbl->highest_used_slotid on success.

> > 
> > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> > ---
> > fs/nfs/callback.h      |    2 +-
> > fs/nfs/callback_proc.c |   18 +++++++++++++-----
> > fs/nfs/callback_xdr.c  |   24 +++++++-----------------
> > 3 files changed, 21 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
> > index b257383..07df5f1 100644
> > --- a/fs/nfs/callback.h
> > +++ b/fs/nfs/callback.h
> > @@ -38,6 +38,7 @@ enum nfs4_callback_opnum {
> > struct cb_process_state {
> > 	__be32			drc_status;
> > 	struct nfs_client	*clp;
> > +	int			slotid;
> > };
> > 
> > struct cb_compound_hdr_arg {
> > @@ -166,7 +167,6 @@ extern unsigned nfs4_callback_layoutrecall(
> > 	void *dummy, struct cb_process_state *cps);
> > 
> > extern void nfs4_check_drain_bc_complete(struct nfs4_session *ses);
> > -extern void nfs4_cb_take_slot(struct nfs_client *clp);
> > 
> > struct cb_devicenotifyitem {
> > 	uint32_t		cbd_notify_type;
> > diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
> > index 74780f9..1bd2c81 100644
> > --- a/fs/nfs/callback_proc.c
> > +++ b/fs/nfs/callback_proc.c
> > @@ -348,7 +348,7 @@ validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
> > 	/* Normal */
> > 	if (likely(args->csa_sequenceid == slot->seq_nr + 1)) {
> > 		slot->seq_nr++;
> > -		return htonl(NFS4_OK);
> > +		goto out_ok;
> > 	}
> > 
> > 	/* Replay */
> > @@ -367,11 +367,14 @@ validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
> > 	/* Wraparound */
> > 	if (args->csa_sequenceid == 1 && (slot->seq_nr + 1) == 0) {
> > 		slot->seq_nr = 1;
> > -		return htonl(NFS4_OK);
> > +		goto out_ok;
> > 	}
> > 
> > 	/* Misordered request */
> > 	return htonl(NFS4ERR_SEQ_MISORDERED);
> > +out_ok:
> > +	tbl->highest_used_slotid = args->csa_slotid;
> > +	return htonl(NFS4_OK);
> > }
> > 
> > /*
> > @@ -433,26 +436,32 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
> > 			      struct cb_sequenceres *res,
> > 			      struct cb_process_state *cps)
> > {
> > +	struct nfs4_slot_table *tbl;
> > 	struct nfs_client *clp;
> > 	int i;
> > 	__be32 status = htonl(NFS4ERR_BADSESSION);
> > 
> > -	cps->clp = NULL;
> > -
> > 	clp = nfs4_find_client_sessionid(args->csa_addr, &args->csa_sessionid);
> > 	if (clp == NULL)
> > 		goto out;
> > 
> > +	tbl = &clp->cl_session->bc_slot_table;
> > +
> > +	spin_lock(&tbl->slot_tbl_lock);
> > 	/* state manager is resetting the session */
> > 	if (test_bit(NFS4_SESSION_DRAINING, &clp->cl_session->session_state)) {
> > 		status = NFS4ERR_DELAY;
>                                  
> status = htonl(NFS4ERR_DELAY);

Yep. I'll fix that too... 

> > +		spin_unlock(&tbl->slot_tbl_lock);
> > 		goto out;
> > 	}
> > 
> > 	status = validate_seqid(&clp->cl_session->bc_slot_table, args);
> > +	spin_unlock(&tbl->slot_tbl_lock);

This is what guarantees the atomicity. Even if
nfs4_begin_drain_session() sets NFS4_SESSION_DRAINING, we know that it
can't test the value of tbl->highest_used_slotid without first taking
the tbl->slot_tbl_lock, and we're holding that...

> > 	if (status)
> > 		goto out;
> > 
> > +	cps->slotid = args->csa_slotid;
> 

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-01 22:35                 ` Trond Myklebust
  2011-08-01 22:57                   ` Andy Adamson
@ 2011-08-02  2:21                   ` Jim Rees
  2011-08-02  2:29                     ` Myklebust, Trond
  1 sibling, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-08-02  2:21 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Peng Tao, William Andros Adamson, Christoph Hellwig, linux-nfs,
	peter honeyman

Trond Myklebust wrote:

  On Mon, 2011-08-01 at 17:10 -0400, Trond Myklebust wrote: 
  > Looking at the callback code, I see that if tbl->highest_used_slotid !=
  > 0, then we BUG() while holding the backchannel's tbl->slot_tbl_lock
  > spinlock. That seems a likely candidate for the above hang.
  > 
  > Andy, how we are guaranteed that tbl->highest_used_slotid won't take
  > values other than 0, and why do we commit suicide when it does? As far
  > as I can see, there is no guarantee that we call nfs4_cb_take_slot() in
  > nfs4_callback_compound(), however we appear to unconditionally call
  > nfs4_cb_free_slot() provided there is a session.
  > 
  > The other strangeness would be the fact that there is nothing enforcing
  > the NFS4_SESSION_DRAINING flag. If the session is draining, then the
  > back-channel simply ignores that and goes ahead with processing the
  > callback. Is this to avoid deadlocks with the server returning
  > NFS4ERR_BACK_CHAN_BUSY when the client does a DESTROY_SESSION?
  
  How about something like the following?

I applied this patch, along with Andy's htonl correction.  It now fails in a
different way, with a deadlock.  The test runs several processes in
parallel.

INFO: task t_mtab:1767 blocked for more than 10 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
t_mtab          D 0000000000000000     0  1767   1634 0x00000080
 ffff8800376afd48 0000000000000086 ffff8800376afcd8 ffffffff00000000
 ffff8800376ae010 ffff880037ef4500 0000000000012c80 ffff8800376affd8
 ffff8800376affd8 0000000000012c80 ffffffff81a0c020 ffff880037ef4500
Call Trace:
 [<ffffffff8145411a>] __mutex_lock_common+0x110/0x171
 [<ffffffff81454191>] __mutex_lock_slowpath+0x16/0x18
 [<ffffffff81454257>] mutex_lock+0x1e/0x32
 [<ffffffff811169a2>] kern_path_create+0x75/0x11e
 [<ffffffff810fe836>] ? kmem_cache_alloc+0x5f/0xf1
 [<ffffffff812127d9>] ? strncpy_from_user+0x43/0x72
 [<ffffffff81114077>] ? getname_flags+0x158/0x1d2
 [<ffffffff81116a86>] user_path_create+0x3b/0x52
 [<ffffffff81117466>] sys_linkat+0x9a/0x120
 [<ffffffff8109932e>] ? audit_syscall_entry+0x119/0x145
 [<ffffffff81117505>] sys_link+0x19/0x1c
 [<ffffffff8145b612>] system_call_fastpath+0x16/0x1b
INFO: task t_mtab:1768 blocked for more than 10 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
t_mtab          D 0000000000000000     0  1768   1634 0x00000080
 ffff880037ccbc18 0000000000000082 ffff880037ccbbe8 ffffffff00000000
 ffff880037cca010 ffff880037ef2e00 0000000000012c80 ffff880037ccbfd8
 ffff880037ccbfd8 0000000000012c80 ffffffff81a0c020 ffff880037ef2e00
Call Trace:
 [<ffffffff8145411a>] __mutex_lock_common+0x110/0x171
 [<ffffffff81454191>] __mutex_lock_slowpath+0x16/0x18
 [<ffffffff81454257>] mutex_lock+0x1e/0x32
 [<ffffffff8111565d>] ? walk_component+0x362/0x38f
 [<ffffffff811e7b9a>] ima_file_check+0x53/0x111
 [<ffffffff81115ae0>] do_last+0x456/0x566
 [<ffffffff81114467>] ? path_init+0x179/0x2b8
 [<ffffffff81116148>] path_openat+0xca/0x30e
 [<ffffffff8111647b>] do_filp_open+0x38/0x84
 [<ffffffff812127d9>] ? strncpy_from_user+0x43/0x72
 [<ffffffff81120014>] ? alloc_fd+0x76/0x11f
 [<ffffffff81109696>] do_sys_open+0x6e/0x100
 [<ffffffff81109751>] sys_open+0x1b/0x1d
 [<ffffffff8145b612>] system_call_fastpath+0x16/0x1b
INFO: task t_mtab:1767 blocked for more than 10 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
t_mtab          D 0000000000000000     0  1767   1634 0x00000080
 ffff8800376afc18 0000000000000086 ffff8800376afbe8 ffffffff00000000
 ffff8800376ae010 ffff880037ef4500 0000000000012c80 ffff8800376affd8
 ffff8800376affd8 0000000000012c80 ffffffff81a0c020 ffff880037ef4500
Call Trace:
 [<ffffffff8145411a>] __mutex_lock_common+0x110/0x171
 [<ffffffff81454191>] __mutex_lock_slowpath+0x16/0x18
 [<ffffffff81454257>] mutex_lock+0x1e/0x32
 [<ffffffff8111565d>] ? walk_component+0x362/0x38f
 [<ffffffff811e7b9a>] ima_file_check+0x53/0x111
 [<ffffffff81115ae0>] do_last+0x456/0x566
 [<ffffffff81114467>] ? path_init+0x179/0x2b8
 [<ffffffff81116148>] path_openat+0xca/0x30e
 [<ffffffff8111647b>] do_filp_open+0x38/0x84
 [<ffffffff812127d9>] ? strncpy_from_user+0x43/0x72
 [<ffffffff81120014>] ? alloc_fd+0x76/0x11f
 [<ffffffff81109696>] do_sys_open+0x6e/0x100
 [<ffffffff81109751>] sys_open+0x1b/0x1d
 [<ffffffff8145b612>] system_call_fastpath+0x16/0x1b
INFO: task t_mtab:1768 blocked for more than 10 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
t_mtab          D 0000000000000000     0  1768   1634 0x00000080
 ffff880037ccbc68 0000000000000082 0000000000000000 0000000000000000
 ffff880037cca010 ffff880037ef2e00 0000000000012c80 ffff880037ccbfd8
 ffff880037ccbfd8 0000000000012c80 ffffffff81a0c020 ffff880037ef2e00
Call Trace:
 [<ffffffff8145411a>] __mutex_lock_common+0x110/0x171
 [<ffffffffa0267ca5>] ? nfs_permission+0xd7/0x168 [nfs]
 [<ffffffff81454191>] __mutex_lock_slowpath+0x16/0x18
 [<ffffffff81454257>] mutex_lock+0x1e/0x32
 [<ffffffff8111581e>] do_last+0x194/0x566
 [<ffffffff81114467>] ? path_init+0x179/0x2b8
 [<ffffffff81116148>] path_openat+0xca/0x30e
 [<ffffffffa028d8fd>] ? __nfs4_close+0xf4/0x101 [nfs]
 [<ffffffff8111647b>] do_filp_open+0x38/0x84
 [<ffffffff812127d9>] ? strncpy_from_user+0x43/0x72
 [<ffffffff81120014>] ? alloc_fd+0x76/0x11f
 [<ffffffff81109696>] do_sys_open+0x6e/0x100
 [<ffffffff81109751>] sys_open+0x1b/0x1d
 [<ffffffff8145b612>] system_call_fastpath+0x16/0x1b
INFO: task t_mtab:1767 blocked for more than 10 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
t_mtab          D 0000000000000000     0  1767   1634 0x00000080
 ffff8800376afd48 0000000000000086 ffff8800376afcd8 ffffffff00000000
 ffff8800376ae010 ffff880037ef4500 0000000000012c80 ffff8800376affd8
 ffff8800376affd8 0000000000012c80 ffffffff81a0c020 ffff880037ef4500
Call Trace:
 [<ffffffff8145411a>] __mutex_lock_common+0x110/0x171
 [<ffffffff81454191>] __mutex_lock_slowpath+0x16/0x18
 [<ffffffff81454257>] mutex_lock+0x1e/0x32
 [<ffffffff811169a2>] kern_path_create+0x75/0x11e
 [<ffffffff810fe836>] ? kmem_cache_alloc+0x5f/0xf1
 [<ffffffff812127d9>] ? strncpy_from_user+0x43/0x72
 [<ffffffff81114077>] ? getname_flags+0x158/0x1d2
 [<ffffffff81116a86>] user_path_create+0x3b/0x52
 [<ffffffff81117466>] sys_linkat+0x9a/0x120
 [<ffffffff8109932e>] ? audit_syscall_entry+0x119/0x145
 [<ffffffff81117505>] sys_link+0x19/0x1c
 [<ffffffff8145b612>] system_call_fastpath+0x16/0x1b
INFO: task t_mtab:1769 blocked for more than 10 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
t_mtab          D 0000000000000000     0  1769   1634 0x00000080
 ffff88006c2d1c18 0000000000000082 ffff88006c2d1be8 ffffffff00000000
 ffff88006c2d0010 ffff880037ef0000 0000000000012c80 ffff88006c2d1fd8
 ffff88006c2d1fd8 0000000000012c80 ffffffff81a0c020 ffff880037ef0000
Call Trace:
 [<ffffffff8145411a>] __mutex_lock_common+0x110/0x171
 [<ffffffff81454191>] __mutex_lock_slowpath+0x16/0x18
 [<ffffffff81454257>] mutex_lock+0x1e/0x32
 [<ffffffff8111565d>] ? walk_component+0x362/0x38f
 [<ffffffff811e7b9a>] ima_file_check+0x53/0x111
 [<ffffffff81115ae0>] do_last+0x456/0x566
 [<ffffffff81114467>] ? path_init+0x179/0x2b8
 [<ffffffff81116148>] path_openat+0xca/0x30e
 [<ffffffff8111647b>] do_filp_open+0x38/0x84
 [<ffffffff812127d9>] ? strncpy_from_user+0x43/0x72
 [<ffffffff81120014>] ? alloc_fd+0x76/0x11f
 [<ffffffff81109696>] do_sys_open+0x6e/0x100
 [<ffffffff81109751>] sys_open+0x1b/0x1d
 [<ffffffff8145b612>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-02  2:21                   ` [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
@ 2011-08-02  2:29                     ` Myklebust, Trond
  2011-08-02  3:23                       ` Jim Rees
  0 siblings, 1 reply; 63+ messages in thread
From: Myklebust, Trond @ 2011-08-02  2:29 UTC (permalink / raw)
  To: Jim Rees
  Cc: Peng Tao, Adamson, Andy, Christoph Hellwig, linux-nfs, peter honeyman

> -----Original Message-----
> From: Jim Rees [mailto:rees@umich.edu]
> Sent: Monday, August 01, 2011 10:22 PM
> To: Myklebust, Trond
> Cc: Peng Tao; Adamson, Andy; Christoph Hellwig; linux-
> nfs@vger.kernel.org; peter honeyman
> Subject: Re: [PATCH v4 00/27] add block layout driver to pnfs client
> 
> Trond Myklebust wrote:
> 
>   On Mon, 2011-08-01 at 17:10 -0400, Trond Myklebust wrote:
>   > Looking at the callback code, I see that if tbl-
> >highest_used_slotid !=
>   > 0, then we BUG() while holding the backchannel's
tbl->slot_tbl_lock
>   > spinlock. That seems a likely candidate for the above hang.
>   >
>   > Andy, how we are guaranteed that tbl->highest_used_slotid won't
> take
>   > values other than 0, and why do we commit suicide when it does? As
> far
>   > as I can see, there is no guarantee that we call
> nfs4_cb_take_slot() in
>   > nfs4_callback_compound(), however we appear to unconditionally
call
>   > nfs4_cb_free_slot() provided there is a session.
>   >
>   > The other strangeness would be the fact that there is nothing
> enforcing
>   > the NFS4_SESSION_DRAINING flag. If the session is draining, then
> the
>   > back-channel simply ignores that and goes ahead with processing
the
>   > callback. Is this to avoid deadlocks with the server returning
>   > NFS4ERR_BACK_CHAN_BUSY when the client does a DESTROY_SESSION?
> 
>   How about something like the following?
> 
> I applied this patch, along with Andy's htonl correction.  It now
fails
> in a
> different way, with a deadlock.  The test runs several processes in
> parallel.
> 
> INFO: task t_mtab:1767 blocked for more than 10 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> t_mtab          D 0000000000000000     0  1767   1634 0x00000080
>  ffff8800376afd48 0000000000000086 ffff8800376afcd8 ffffffff00000000
>  ffff8800376ae010 ffff880037ef4500 0000000000012c80 ffff8800376affd8
>  ffff8800376affd8 0000000000012c80 ffffffff81a0c020 ffff880037ef4500
> Call Trace:
>  [<ffffffff8145411a>] __mutex_lock_common+0x110/0x171
>  [<ffffffff81454191>] __mutex_lock_slowpath+0x16/0x18
>  [<ffffffff81454257>] mutex_lock+0x1e/0x32
>  [<ffffffff811169a2>] kern_path_create+0x75/0x11e
>  [<ffffffff810fe836>] ? kmem_cache_alloc+0x5f/0xf1
>  [<ffffffff812127d9>] ? strncpy_from_user+0x43/0x72
>  [<ffffffff81114077>] ? getname_flags+0x158/0x1d2
>  [<ffffffff81116a86>] user_path_create+0x3b/0x52
>  [<ffffffff81117466>] sys_linkat+0x9a/0x120
>  [<ffffffff8109932e>] ? audit_syscall_entry+0x119/0x145
>  [<ffffffff81117505>] sys_link+0x19/0x1c
>  [<ffffffff8145b612>] system_call_fastpath+0x16/0x1b

That's a different issue. If you do an 'echo t >/proc/sysrq-trigger',
then do you see any other process that is stuck in the nfs layer and
that might be holding the inode->i_mutex?



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-02  2:29                     ` Myklebust, Trond
@ 2011-08-02  3:23                       ` Jim Rees
  2011-08-02 12:28                         ` Trond Myklebust
  0 siblings, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-08-02  3:23 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Peng Tao, Adamson, Andy, Christoph Hellwig, linux-nfs, peter honeyman

Myklebust, Trond wrote:

  That's a different issue. If you do an 'echo t >/proc/sysrq-trigger',
  then do you see any other process that is stuck in the nfs layer and
  that might be holding the inode->i_mutex?

Hard to tell.  There are a couple of possibilities.  I've put the console
output here, and will investigate more in the morning:

http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/console.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-02  3:23                       ` Jim Rees
@ 2011-08-02 12:28                         ` Trond Myklebust
  2011-08-02 12:56                           ` Jim Rees
  2011-08-03  1:48                           ` Jim Rees
  0 siblings, 2 replies; 63+ messages in thread
From: Trond Myklebust @ 2011-08-02 12:28 UTC (permalink / raw)
  To: Jim Rees
  Cc: Peng Tao, Adamson, Andy, Christoph Hellwig, linux-nfs, peter honeyman

On Mon, 2011-08-01 at 23:23 -0400, Jim Rees wrote: 
> Myklebust, Trond wrote:
> 
>   That's a different issue. If you do an 'echo t >/proc/sysrq-trigger',
>   then do you see any other process that is stuck in the nfs layer and
>   that might be holding the inode->i_mutex?
> 
> Hard to tell.  There are a couple of possibilities.  I've put the console
> output here, and will investigate more in the morning:
> 
> http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/console.txt


Hmm... You don't seem to have all the mutex debugging options enabled,
but I strongly suspect that this thread is the culprit.


> t_mtab          D 0000000000000000     0  1833   1698 0x00000080
>  ffff880072f73ac8 0000000000000086 ffff880072f73a98 ffffffff00000000
>  ffff880072f72010 ffff8800698fdc00 0000000000012c80 ffff880072f73fd8
>  ffff880072f73fd8 0000000000012c80 ffffffff81a0c020 ffff8800698fdc00
> Call Trace:
>  [<ffffffffa01f5a52>] ? rpc_queue_empty+0x29/0x29 [sunrpc]
>  [<ffffffffa01f5a81>] rpc_wait_bit_killable+0x2f/0x33 [sunrpc]
>  [<ffffffff81453eec>] __wait_on_bit+0x43/0x76
>  [<ffffffff81453f88>] out_of_line_wait_on_bit+0x69/0x74
>  [<ffffffffa01f5a52>] ? rpc_queue_empty+0x29/0x29 [sunrpc]
>  [<ffffffff81069213>] ? autoremove_wake_function+0x38/0x38
>  [<ffffffffa01f5f0a>] __rpc_execute+0xed/0x249 [sunrpc]
>  [<ffffffffa01f60a3>] rpc_execute+0x3d/0x42 [sunrpc]
>  [<ffffffffa01efbca>] rpc_run_task+0x79/0x81 [sunrpc]
>  [<ffffffffa028c3b7>] nfs4_call_sync_sequence+0x60/0x81 [nfs]
>  [<ffffffff810fe4af>] ? kmem_cache_alloc_trace+0xab/0xbd
>  [<ffffffffa028c55e>] _nfs4_call_sync_session+0x14/0x16 [nfs]
>  [<ffffffffa0279b7d>] ? nfs_alloc_fattr+0x24/0x57 [nfs]
>  [<ffffffffa028ecd1>] _nfs4_proc_remove+0xcd/0x110 [nfs]
>  [<ffffffffa028ed43>] nfs4_proc_remove+0x2f/0x55 [nfs]
>  [<ffffffffa027524f>] nfs_unlink+0xf5/0x1a3 [nfs]
>  [<ffffffff81114c66>] vfs_unlink+0x70/0xd1
>  [<ffffffff8111714d>] do_unlinkat+0xd5/0x161
>  [<ffffffff8112247e>] ? mntput+0x21/0x23
>  [<ffffffff8111381d>] ? path_put+0x1d/0x21
>  [<ffffffff8109932e>] ? audit_syscall_entry+0x119/0x145
>  [<ffffffff811171ea>] sys_unlink+0x11/0x13
>  [<ffffffff8145b612>] system_call_fastpath+0x16/0x1b

Any idea which file that thread is deleting, and why that might be
hanging?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-02 12:28                         ` Trond Myklebust
@ 2011-08-02 12:56                           ` Jim Rees
  2011-08-03  1:48                           ` Jim Rees
  1 sibling, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-08-02 12:56 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Peng Tao, Adamson, Andy, Christoph Hellwig, linux-nfs, peter honeyman

Trond Myklebust wrote:

  Any idea which file that thread is deleting, and why that might be
  hanging?

The test is a copy of the part of the mount command that writes /etc/mtab.
So it's doing some sequence of link, write, flock.  It does this thousands
of times, much of it in parallel.  I'll get a trace of a single instance of
this so we can see what the actual syscalls are.  I'll also see about
turning on the mutex debug.  Give me a couple of hours.

If you want to look at the source code for the test, clone this:
git://oss.sgi.com/xfs/cmds/xfstests

and it's in src/t_mtab.c.  It's driven by the "089" script.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-01 23:11                     ` Trond Myklebust
@ 2011-08-02 17:30                       ` Trond Myklebust
  2011-08-02 18:50                         ` [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour Trond Myklebust
  0 siblings, 1 reply; 63+ messages in thread
From: Trond Myklebust @ 2011-08-02 17:30 UTC (permalink / raw)
  To: Andy Adamson
  Cc: Peng Tao, Jim Rees, Christoph Hellwig, linux-nfs, peter honeyman

On Mon, 2011-08-01 at 19:11 -0400, Trond Myklebust wrote: 
> On Mon, 2011-08-01 at 18:57 -0400, Andy Adamson wrote: 
> > On Aug 1, 2011, at 6:35 PM, Trond Myklebust wrote:
> > > /*
> > > @@ -433,26 +436,32 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
> > > 			      struct cb_sequenceres *res,
> > > 			      struct cb_process_state *cps)
> > > {
> > > +	struct nfs4_slot_table *tbl;
> > > 	struct nfs_client *clp;
> > > 	int i;
> > > 	__be32 status = htonl(NFS4ERR_BADSESSION);
> > > 
> > > -	cps->clp = NULL;
> > > -
> > > 	clp = nfs4_find_client_sessionid(args->csa_addr, &args->csa_sessionid);
> > > 	if (clp == NULL)
> > > 		goto out;
> > > 
> > > +	tbl = &clp->cl_session->bc_slot_table;
> > > +
> > > +	spin_lock(&tbl->slot_tbl_lock);
> > > 	/* state manager is resetting the session */
> > > 	if (test_bit(NFS4_SESSION_DRAINING, &clp->cl_session->session_state)) {
> > > 		status = NFS4ERR_DELAY;
> >                                  
> > status = htonl(NFS4ERR_DELAY);
> 
> Yep. I'll fix that too... 

Looking at this again, I'm not sure that the above is safe. If you
return NFS4ERR_DELAY on the SEQUENCE operation, then that has a very
specific meaning of "I'm executing your command, but it will take a
while to complete.".
It therefore seems to me that if we return NFS4ERR_DELAY, then we are
basically setting ourselves up for a deadlock: 
      * On the one hand, we're trying to destroy the session 
      * On the other hand, we're telling the server that 'your callback
        is being processed'.
In that situation, it seems to me that it should be perfectly valid for
the server to reject our DESTROY_SESSION call with a
NFS4ERR_BACK_CHAN_BUSY until we process the callback.

Comments?

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour
  2011-08-02 17:30                       ` Trond Myklebust
@ 2011-08-02 18:50                         ` Trond Myklebust
  2011-08-02 18:50                           ` [PATCH v2 2/2] NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resets Trond Myklebust
  2011-08-03  8:52                           ` [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour Peng Tao
  0 siblings, 2 replies; 63+ messages in thread
From: Trond Myklebust @ 2011-08-02 18:50 UTC (permalink / raw)
  To: Andy Adamson
  Cc: Peng Tao, Jim Rees, Christoph Hellwig, linux-nfs, peter honeyman

Currently, there is no guarantee that we will call nfs4_cb_take_slot() even
though nfs4_callback_compound() will consistently call
nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field.
The result is that we can trigger the BUG_ON() upon the next call to
nfs4_cb_take_slot().

This patch fixes the above problem by using the slot id that was taken in
the CB_SEQUENCE operation as a flag for whether or not we need to call
nfs4_cb_free_slot().
It also fixes an atomicity problem: we need to set tbl->highest_used_slotid
atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up
racing with the various tests in nfs4_begin_drain_session().

Cc: stable@kernel.org [2.6.38+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/nfs/callback.h      |    2 +-
 fs/nfs/callback_proc.c |   20 ++++++++++++++------
 fs/nfs/callback_xdr.c  |   24 +++++++-----------------
 3 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
index b257383..07df5f1 100644
--- a/fs/nfs/callback.h
+++ b/fs/nfs/callback.h
@@ -38,6 +38,7 @@ enum nfs4_callback_opnum {
 struct cb_process_state {
 	__be32			drc_status;
 	struct nfs_client	*clp;
+	int			slotid;
 };
 
 struct cb_compound_hdr_arg {
@@ -166,7 +167,6 @@ extern unsigned nfs4_callback_layoutrecall(
 	void *dummy, struct cb_process_state *cps);
 
 extern void nfs4_check_drain_bc_complete(struct nfs4_session *ses);
-extern void nfs4_cb_take_slot(struct nfs_client *clp);
 
 struct cb_devicenotifyitem {
 	uint32_t		cbd_notify_type;
diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 74780f9..0ab8202 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -348,7 +348,7 @@ validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
 	/* Normal */
 	if (likely(args->csa_sequenceid == slot->seq_nr + 1)) {
 		slot->seq_nr++;
-		return htonl(NFS4_OK);
+		goto out_ok;
 	}
 
 	/* Replay */
@@ -367,11 +367,14 @@ validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
 	/* Wraparound */
 	if (args->csa_sequenceid == 1 && (slot->seq_nr + 1) == 0) {
 		slot->seq_nr = 1;
-		return htonl(NFS4_OK);
+		goto out_ok;
 	}
 
 	/* Misordered request */
 	return htonl(NFS4ERR_SEQ_MISORDERED);
+out_ok:
+	tbl->highest_used_slotid = args->csa_slotid;
+	return htonl(NFS4_OK);
 }
 
 /*
@@ -433,26 +436,32 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
 			      struct cb_sequenceres *res,
 			      struct cb_process_state *cps)
 {
+	struct nfs4_slot_table *tbl;
 	struct nfs_client *clp;
 	int i;
 	__be32 status = htonl(NFS4ERR_BADSESSION);
 
-	cps->clp = NULL;
-
 	clp = nfs4_find_client_sessionid(args->csa_addr, &args->csa_sessionid);
 	if (clp == NULL)
 		goto out;
 
+	tbl = &clp->cl_session->bc_slot_table;
+
+	spin_lock(&tbl->slot_tbl_lock);
 	/* state manager is resetting the session */
 	if (test_bit(NFS4_SESSION_DRAINING, &clp->cl_session->session_state)) {
-		status = NFS4ERR_DELAY;
+		spin_unlock(&tbl->slot_tbl_lock);
+		status = htonl(NFS4ERR_DELAY);
 		goto out;
 	}
 
 	status = validate_seqid(&clp->cl_session->bc_slot_table, args);
+	spin_unlock(&tbl->slot_tbl_lock);
 	if (status)
 		goto out;
 
+	cps->slotid = args->csa_slotid;
+
 	/*
 	 * Check for pending referring calls.  If a match is found, a
 	 * related callback was received before the response to the original
@@ -469,7 +478,6 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
 	res->csr_slotid = args->csa_slotid;
 	res->csr_highestslotid = NFS41_BC_MAX_CALLBACKS - 1;
 	res->csr_target_highestslotid = NFS41_BC_MAX_CALLBACKS - 1;
-	nfs4_cb_take_slot(clp);
 
 out:
 	cps->clp = clp; /* put in nfs4_callback_compound */
diff --git a/fs/nfs/callback_xdr.c b/fs/nfs/callback_xdr.c
index c6c86a7..918ad64 100644
--- a/fs/nfs/callback_xdr.c
+++ b/fs/nfs/callback_xdr.c
@@ -754,26 +754,15 @@ static void nfs4_callback_free_slot(struct nfs4_session *session)
 	 * Let the state manager know callback processing done.
 	 * A single slot, so highest used slotid is either 0 or -1
 	 */
-	tbl->highest_used_slotid--;
+	tbl->highest_used_slotid = -1;
 	nfs4_check_drain_bc_complete(session);
 	spin_unlock(&tbl->slot_tbl_lock);
 }
 
-static void nfs4_cb_free_slot(struct nfs_client *clp)
+static void nfs4_cb_free_slot(struct cb_process_state *cps)
 {
-	if (clp && clp->cl_session)
-		nfs4_callback_free_slot(clp->cl_session);
-}
-
-/* A single slot, so highest used slotid is either 0 or -1 */
-void nfs4_cb_take_slot(struct nfs_client *clp)
-{
-	struct nfs4_slot_table *tbl = &clp->cl_session->bc_slot_table;
-
-	spin_lock(&tbl->slot_tbl_lock);
-	tbl->highest_used_slotid++;
-	BUG_ON(tbl->highest_used_slotid != 0);
-	spin_unlock(&tbl->slot_tbl_lock);
+	if (cps->slotid != -1)
+		nfs4_callback_free_slot(cps->clp->cl_session);
 }
 
 #else /* CONFIG_NFS_V4_1 */
@@ -784,7 +773,7 @@ preprocess_nfs41_op(int nop, unsigned int op_nr, struct callback_op **op)
 	return htonl(NFS4ERR_MINOR_VERS_MISMATCH);
 }
 
-static void nfs4_cb_free_slot(struct nfs_client *clp)
+static void nfs4_cb_free_slot(struct cb_process_state *cps)
 {
 }
 #endif /* CONFIG_NFS_V4_1 */
@@ -866,6 +855,7 @@ static __be32 nfs4_callback_compound(struct svc_rqst *rqstp, void *argp, void *r
 	struct cb_process_state cps = {
 		.drc_status = 0,
 		.clp = NULL,
+		.slotid = -1,
 	};
 	unsigned int nops = 0;
 
@@ -906,7 +896,7 @@ static __be32 nfs4_callback_compound(struct svc_rqst *rqstp, void *argp, void *r
 
 	*hdr_res.status = status;
 	*hdr_res.nops = htonl(nops);
-	nfs4_cb_free_slot(cps.clp);
+	nfs4_cb_free_slot(&cps);
 	nfs_put_client(cps.clp);
 	dprintk("%s: done, status = %u\n", __func__, ntohl(status));
 	return rpc_success;
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/2] NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resets
  2011-08-02 18:50                         ` [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour Trond Myklebust
@ 2011-08-02 18:50                           ` Trond Myklebust
  2011-08-03  8:52                           ` [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour Peng Tao
  1 sibling, 0 replies; 63+ messages in thread
From: Trond Myklebust @ 2011-08-02 18:50 UTC (permalink / raw)
  To: Andy Adamson
  Cc: Peng Tao, Jim Rees, Christoph Hellwig, linux-nfs, peter honeyman

If the client is in the process of resetting the session when it receives
a callback, then returning NFS4ERR_DELAY may cause a deadlock with the
DESTROY_SESSION call.

Basically, if the client returns NFS4ERR_DELAY in response to the
CB_SEQUENCE call, then the server is entitled to believe that the
client is busy because it is already processing that call. In that
case, the server is perfectly entitled to respond with a
NFS4ERR_BACK_CHAN_BUSY to any DESTROY_SESSION call.

Fix this by having the client reply with a NFS4ERR_BADSESSION in
response to the callback if it is resetting the session.

Cc: stable@kernel.org [2.6.38+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/nfs/callback_proc.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 0ab8202..43926ad 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -452,6 +452,11 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
 	if (test_bit(NFS4_SESSION_DRAINING, &clp->cl_session->session_state)) {
 		spin_unlock(&tbl->slot_tbl_lock);
 		status = htonl(NFS4ERR_DELAY);
+		/* Return NFS4ERR_BADSESSION if we're draining the session
+		 * in order to reset it.
+		 */
+		if (test_bit(NFS4CLNT_SESSION_RESET, &clp->cl_state))
+			status = htonl(NFS4ERR_BADSESSION);
 		goto out;
 	}
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-02 12:28                         ` Trond Myklebust
  2011-08-02 12:56                           ` Jim Rees
@ 2011-08-03  1:48                           ` Jim Rees
  2011-08-03  2:07                             ` Myklebust, Trond
  1 sibling, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-08-03  1:48 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Peng Tao, Adamson, Andy, Christoph Hellwig, linux-nfs, peter honeyman

Here's what the test is doing.  It does multiple parallel instances of this,
each one doing thousands of the following in a loop.  Console output with
mutex and lock debug is in
http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/console.txt

getpid()                                = 2431
open("t_mtab~2431", O_WRONLY|O_CREAT, 0) = 3
close(3)                                = 0
link("t_mtab~2431", "t_mtab~")          = 0
unlink("t_mtab~2431")                   = 0
open("t_mtab~", O_WRONLY)               = 3
fcntl(3, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
close(3)                                = 0
brk(0)                                  = 0x19ab000
brk(0x19cc000)                          = 0x19cc000
brk(0)                                  = 0x19cc000
open("t_mtab", O_RDONLY)                = 3
open("t_mtab.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
read(3, "/proc on /proc type proc (rw,rel"..., 4096) = 2598
write(4, "/proc on /proc type proc (rw,rel"..., 2598) = 2598
read(3, "", 4096)                       = 0
close(3)                                = 0
fchmod(4, 0644)                         = 0
close(4)                                = 0
stat("t_mtab", {st_mode=S_IFREG|0644, st_size=2598, ...}) = 0
chown("t_mtab.tmp", 99, 0)              = -1 EINVAL (Invalid argument)
rename("t_mtab.tmp", "t_mtab")          = 0
unlink("t_mtab~")                       = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd95cd76000
write(1, "completed 1 iterations\n", 23) = 23
exit_group(0)                           = ?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-03  1:48                           ` Jim Rees
@ 2011-08-03  2:07                             ` Myklebust, Trond
       [not found]                               ` <2E1EB2CF9ED1CB4AA966F0EB76EAB4430A778AE2-hX7t0kiaRRrlMGe9HJ1VYQK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
  2011-08-03  2:38                               ` Jim Rees
  0 siblings, 2 replies; 63+ messages in thread
From: Myklebust, Trond @ 2011-08-03  2:07 UTC (permalink / raw)
  To: Jim Rees
  Cc: Peng Tao, Adamson, Andy, Christoph Hellwig, linux-nfs, peter honeyman

> -----Original Message-----
> From: Jim Rees [mailto:rees@umich.edu]
> Sent: Tuesday, August 02, 2011 9:48 PM
> To: Myklebust, Trond
> Cc: Peng Tao; Adamson, Andy; Christoph Hellwig; linux-
> nfs@vger.kernel.org; peter honeyman
> Subject: Re: [PATCH v4 00/27] add block layout driver to pnfs client
> 
> Here's what the test is doing.  It does multiple parallel instances of
> this,
> each one doing thousands of the following in a loop.  Console output
> with
> mutex and lock debug is in
>
http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/console.tx
> t

Hmm... That trace appears to show that the contention is between
processes trying to grab the same inode->i_mutex (in ima_file_check()
and do_unlinkat()). The question is why is the unlink process hanging
for such a long time?

I suspect another callback issue is causing the unlink() to stall:
either our client failing to handle a server callback correctly, or
possibly the server failing to respond correctly to our reply.

Can you try to turn on the callback debugging ('echo 256 >
/proc/sys/sunrpc/nfs_debug')? A wireshark trace of what is going on
during the hang itself might also help.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
       [not found]                               ` <2E1EB2CF9ED1CB4AA966F0EB76EAB4430A778AE2-hX7t0kiaRRrlMGe9HJ1VYQK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
@ 2011-08-03  2:11                                 ` Jim Rees
  0 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-08-03  2:11 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Peng Tao, Adamson, Andy, Christoph Hellwig, linux-nfs, peter honeyman

Myklebust, Trond wrote:

  Can you try to turn on the callback debugging ('echo 256 >
  /proc/sys/sunrpc/nfs_debug')? A wireshark trace of what is going on
  during the hang itself might also help.

Will do, tomorrow morning.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-03  2:07                             ` Myklebust, Trond
       [not found]                               ` <2E1EB2CF9ED1CB4AA966F0EB76EAB4430A778AE2-hX7t0kiaRRrlMGe9HJ1VYQK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
@ 2011-08-03  2:38                               ` Jim Rees
  2011-08-03  8:43                                 ` Peng Tao
  1 sibling, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-08-03  2:38 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Peng Tao, Adamson, Andy, Christoph Hellwig, linux-nfs, peter honeyman

Myklebust, Trond wrote:

  Can you try to turn on the callback debugging ('echo 256 >
  /proc/sys/sunrpc/nfs_debug')? A wireshark trace of what is going on
  during the hang itself might also help.

It doesn't fail with callback debugging turned on.  I suspect the output to
my slow serial console disrupts the timing.

The trace is in
http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/089.pcap

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-03  2:38                               ` Jim Rees
@ 2011-08-03  8:43                                 ` Peng Tao
  2011-08-03 11:49                                   ` Jim Rees
  2011-08-03 11:53                                   ` Jim Rees
  0 siblings, 2 replies; 63+ messages in thread
From: Peng Tao @ 2011-08-03  8:43 UTC (permalink / raw)
  To: Jim Rees
  Cc: Myklebust, Trond, Adamson, Andy, Christoph Hellwig, linux-nfs,
	peter honeyman

On Wed, Aug 3, 2011 at 10:38 AM, Jim Rees <rees@umich.edu> wrote:
> Myklebust, Trond wrote:
>
>  Can you try to turn on the callback debugging ('echo 256 >
>  /proc/sys/sunrpc/nfs_debug')? A wireshark trace of what is going on
>  during the hang itself might also help.
>
> It doesn't fail with callback debugging turned on.  I suspect the output to
> my slow serial console disrupts the timing.
>
> The trace is in
> http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/089.pcap
In frame 8785, client returned NFS4ERR_DELAY in response to
CB_SEQUENCE and the next RPC was 12s later. So it should be the case
that Trond mentioned above.

-- 
Thanks,
Tao

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour
  2011-08-02 18:50                         ` [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour Trond Myklebust
  2011-08-02 18:50                           ` [PATCH v2 2/2] NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resets Trond Myklebust
@ 2011-08-03  8:52                           ` Peng Tao
  1 sibling, 0 replies; 63+ messages in thread
From: Peng Tao @ 2011-08-03  8:52 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andy Adamson, Jim Rees, Christoph Hellwig, linux-nfs, peter honeyman

I applied the two patches and tested more than 10 runs and no hang so far.

Thanks,
Tao

On Wed, Aug 3, 2011 at 2:50 AM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> Currently, there is no guarantee that we will call nfs4_cb_take_slot() even
> though nfs4_callback_compound() will consistently call
> nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field.
> The result is that we can trigger the BUG_ON() upon the next call to
> nfs4_cb_take_slot().
>
> This patch fixes the above problem by using the slot id that was taken in
> the CB_SEQUENCE operation as a flag for whether or not we need to call
> nfs4_cb_free_slot().
> It also fixes an atomicity problem: we need to set tbl->highest_used_slotid
> atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up
> racing with the various tests in nfs4_begin_drain_session().
>
> Cc: stable@kernel.org [2.6.38+]
> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> ---
>  fs/nfs/callback.h      |    2 +-
>  fs/nfs/callback_proc.c |   20 ++++++++++++++------
>  fs/nfs/callback_xdr.c  |   24 +++++++-----------------
>  3 files changed, 22 insertions(+), 24 deletions(-)
>
> diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
> index b257383..07df5f1 100644
> --- a/fs/nfs/callback.h
> +++ b/fs/nfs/callback.h
> @@ -38,6 +38,7 @@ enum nfs4_callback_opnum {
>  struct cb_process_state {
>        __be32                  drc_status;
>        struct nfs_client       *clp;
> +       int                     slotid;
>  };
>
>  struct cb_compound_hdr_arg {
> @@ -166,7 +167,6 @@ extern unsigned nfs4_callback_layoutrecall(
>        void *dummy, struct cb_process_state *cps);
>
>  extern void nfs4_check_drain_bc_complete(struct nfs4_session *ses);
> -extern void nfs4_cb_take_slot(struct nfs_client *clp);
>
>  struct cb_devicenotifyitem {
>        uint32_t                cbd_notify_type;
> diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
> index 74780f9..0ab8202 100644
> --- a/fs/nfs/callback_proc.c
> +++ b/fs/nfs/callback_proc.c
> @@ -348,7 +348,7 @@ validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
>        /* Normal */
>        if (likely(args->csa_sequenceid == slot->seq_nr + 1)) {
>                slot->seq_nr++;
> -               return htonl(NFS4_OK);
> +               goto out_ok;
>        }
>
>        /* Replay */
> @@ -367,11 +367,14 @@ validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
>        /* Wraparound */
>        if (args->csa_sequenceid == 1 && (slot->seq_nr + 1) == 0) {
>                slot->seq_nr = 1;
> -               return htonl(NFS4_OK);
> +               goto out_ok;
>        }
>
>        /* Misordered request */
>        return htonl(NFS4ERR_SEQ_MISORDERED);
> +out_ok:
> +       tbl->highest_used_slotid = args->csa_slotid;
> +       return htonl(NFS4_OK);
>  }
>
>  /*
> @@ -433,26 +436,32 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
>                              struct cb_sequenceres *res,
>                              struct cb_process_state *cps)
>  {
> +       struct nfs4_slot_table *tbl;
>        struct nfs_client *clp;
>        int i;
>        __be32 status = htonl(NFS4ERR_BADSESSION);
>
> -       cps->clp = NULL;
> -
>        clp = nfs4_find_client_sessionid(args->csa_addr, &args->csa_sessionid);
>        if (clp == NULL)
>                goto out;
>
> +       tbl = &clp->cl_session->bc_slot_table;
> +
> +       spin_lock(&tbl->slot_tbl_lock);
>        /* state manager is resetting the session */
>        if (test_bit(NFS4_SESSION_DRAINING, &clp->cl_session->session_state)) {
> -               status = NFS4ERR_DELAY;
> +               spin_unlock(&tbl->slot_tbl_lock);
> +               status = htonl(NFS4ERR_DELAY);
>                goto out;
>        }
>
>        status = validate_seqid(&clp->cl_session->bc_slot_table, args);
> +       spin_unlock(&tbl->slot_tbl_lock);
>        if (status)
>                goto out;
>
> +       cps->slotid = args->csa_slotid;
> +
>        /*
>         * Check for pending referring calls.  If a match is found, a
>         * related callback was received before the response to the original
> @@ -469,7 +478,6 @@ __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
>        res->csr_slotid = args->csa_slotid;
>        res->csr_highestslotid = NFS41_BC_MAX_CALLBACKS - 1;
>        res->csr_target_highestslotid = NFS41_BC_MAX_CALLBACKS - 1;
> -       nfs4_cb_take_slot(clp);
>
>  out:
>        cps->clp = clp; /* put in nfs4_callback_compound */
> diff --git a/fs/nfs/callback_xdr.c b/fs/nfs/callback_xdr.c
> index c6c86a7..918ad64 100644
> --- a/fs/nfs/callback_xdr.c
> +++ b/fs/nfs/callback_xdr.c
> @@ -754,26 +754,15 @@ static void nfs4_callback_free_slot(struct nfs4_session *session)
>         * Let the state manager know callback processing done.
>         * A single slot, so highest used slotid is either 0 or -1
>         */
> -       tbl->highest_used_slotid--;
> +       tbl->highest_used_slotid = -1;
>        nfs4_check_drain_bc_complete(session);
>        spin_unlock(&tbl->slot_tbl_lock);
>  }
>
> -static void nfs4_cb_free_slot(struct nfs_client *clp)
> +static void nfs4_cb_free_slot(struct cb_process_state *cps)
>  {
> -       if (clp && clp->cl_session)
> -               nfs4_callback_free_slot(clp->cl_session);
> -}
> -
> -/* A single slot, so highest used slotid is either 0 or -1 */
> -void nfs4_cb_take_slot(struct nfs_client *clp)
> -{
> -       struct nfs4_slot_table *tbl = &clp->cl_session->bc_slot_table;
> -
> -       spin_lock(&tbl->slot_tbl_lock);
> -       tbl->highest_used_slotid++;
> -       BUG_ON(tbl->highest_used_slotid != 0);
> -       spin_unlock(&tbl->slot_tbl_lock);
> +       if (cps->slotid != -1)
> +               nfs4_callback_free_slot(cps->clp->cl_session);
>  }
>
>  #else /* CONFIG_NFS_V4_1 */
> @@ -784,7 +773,7 @@ preprocess_nfs41_op(int nop, unsigned int op_nr, struct callback_op **op)
>        return htonl(NFS4ERR_MINOR_VERS_MISMATCH);
>  }
>
> -static void nfs4_cb_free_slot(struct nfs_client *clp)
> +static void nfs4_cb_free_slot(struct cb_process_state *cps)
>  {
>  }
>  #endif /* CONFIG_NFS_V4_1 */
> @@ -866,6 +855,7 @@ static __be32 nfs4_callback_compound(struct svc_rqst *rqstp, void *argp, void *r
>        struct cb_process_state cps = {
>                .drc_status = 0,
>                .clp = NULL,
> +               .slotid = -1,
>        };
>        unsigned int nops = 0;
>
> @@ -906,7 +896,7 @@ static __be32 nfs4_callback_compound(struct svc_rqst *rqstp, void *argp, void *r
>
>        *hdr_res.status = status;
>        *hdr_res.nops = htonl(nops);
> -       nfs4_cb_free_slot(cps.clp);
> +       nfs4_cb_free_slot(&cps);
>        nfs_put_client(cps.clp);
>        dprintk("%s: done, status = %u\n", __func__, ntohl(status));
>        return rpc_success;
> --
> 1.7.6
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-03  8:43                                 ` Peng Tao
@ 2011-08-03 11:49                                   ` Jim Rees
  2011-08-03 11:53                                   ` Jim Rees
  1 sibling, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-08-03 11:49 UTC (permalink / raw)
  To: Peng Tao
  Cc: Myklebust, Trond, Adamson, Andy, Christoph Hellwig, linux-nfs,
	peter honeyman

Peng Tao wrote:

  On Wed, Aug 3, 2011 at 10:38 AM, Jim Rees <rees@umich.edu> wrote:
  > Myklebust, Trond wrote:
  >
  >  Can you try to turn on the callback debugging ('echo 256 >
  >  /proc/sys/sunrpc/nfs_debug')? A wireshark trace of what is going on
  >  during the hang itself might also help.
  >
  > It doesn't fail with callback debugging turned on.  I suspect the output to
  > my slow serial console disrupts the timing.
  >
  > The trace is in
  > http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/089.pcap
  In frame 8785, client returned NFS4ERR_DELAY in response to
  CB_SEQUENCE and the next RPC was 12s later. So it should be the case
  that Trond mentioned above.

Yes, and there's some funny stuff earlier too.  In frame 6381 the client
returns a delegation, then in 6387 the server tries to recall the same
delegation.  Which I suppose is legal but I wonder why.  This is followed by
another 12 sec delay.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-03  8:43                                 ` Peng Tao
  2011-08-03 11:49                                   ` Jim Rees
@ 2011-08-03 11:53                                   ` Jim Rees
  2011-08-03 13:59                                     ` Peng Tao
  1 sibling, 1 reply; 63+ messages in thread
From: Jim Rees @ 2011-08-03 11:53 UTC (permalink / raw)
  To: Peng Tao
  Cc: Myklebust, Trond, Adamson, Andy, Christoph Hellwig, linux-nfs,
	peter honeyman

Peng Tao wrote:

  On Wed, Aug 3, 2011 at 10:38 AM, Jim Rees <rees@umich.edu> wrote:
  > Myklebust, Trond wrote:
  >
  >  Can you try to turn on the callback debugging ('echo 256 >
  >  /proc/sys/sunrpc/nfs_debug')? A wireshark trace of what is going on
  >  during the hang itself might also help.
  >
  > It doesn't fail with callback debugging turned on.  I suspect the output to
  > my slow serial console disrupts the timing.
  >
  > The trace is in
  > http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/089.pcap
  In frame 8785, client returned NFS4ERR_DELAY in response to
  CB_SEQUENCE and the next RPC was 12s later. So it should be the case
  that Trond mentioned above.

That trace was taken with Trond's two patches applied.  So they did not fix
the problem for me.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-03 11:53                                   ` Jim Rees
@ 2011-08-03 13:59                                     ` Peng Tao
  2011-08-03 14:11                                       ` Jim Rees
  0 siblings, 1 reply; 63+ messages in thread
From: Peng Tao @ 2011-08-03 13:59 UTC (permalink / raw)
  To: Jim Rees
  Cc: Myklebust, Trond, Adamson, Andy, Christoph Hellwig, linux-nfs,
	peter honeyman

On Wed, Aug 3, 2011 at 7:53 PM, Jim Rees <rees@umich.edu> wrote:
> Peng Tao wrote:
>
>  On Wed, Aug 3, 2011 at 10:38 AM, Jim Rees <rees@umich.edu> wrote:
>  > Myklebust, Trond wrote:
>  >
>  >  Can you try to turn on the callback debugging ('echo 256 >
>  >  /proc/sys/sunrpc/nfs_debug')? A wireshark trace of what is going on
>  >  during the hang itself might also help.
>  >
>  > It doesn't fail with callback debugging turned on.  I suspect the output to
>  > my slow serial console disrupts the timing.
>  >
>  > The trace is in
>  > http://www.citi.umich.edu/projects/nfsv4/pnfs/block/download/089.pcap
>  In frame 8785, client returned NFS4ERR_DELAY in response to
>  CB_SEQUENCE and the next RPC was 12s later. So it should be the case
>  that Trond mentioned above.
>
> That trace was taken with Trond's two patches applied.  So they did not fix
> the problem for me.
>
How do you reproduce it? I was reproducing it by running ./check -nfs
and let it run to case 089 and kill it after case 089 run for seconds.

-- 
Thanks,
-Bergwolf

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 00/27] add block layout driver to pnfs client
  2011-08-03 13:59                                     ` Peng Tao
@ 2011-08-03 14:11                                       ` Jim Rees
  0 siblings, 0 replies; 63+ messages in thread
From: Jim Rees @ 2011-08-03 14:11 UTC (permalink / raw)
  To: Peng Tao
  Cc: Myklebust, Trond, Adamson, Andy, Christoph Hellwig, linux-nfs,
	peter honeyman

Peng Tao wrote:

  How do you reproduce it? I was reproducing it by running ./check -nfs
  and let it run to case 089 and kill it after case 089 run for seconds.

./check -nfs 089

After less than a minute, I start getting hung tasks (I have hung task
timeout set to 10 sec).

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2011-08-03 14:11 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-28 17:30 [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
2011-07-28 17:30 ` [PATCH v4 01/27] pnfs: GETDEVICELIST Jim Rees
2011-07-28 17:30 ` [PATCH v4 02/27] pnfs: add set-clear layoutdriver interface Jim Rees
2011-07-28 17:30 ` [PATCH v4 03/27] pnfs: save layoutcommit lwb at layout header Jim Rees
2011-07-28 17:30 ` [PATCH v4 04/27] pnfs: save layoutcommit cred " Jim Rees
2011-07-28 17:30 ` [PATCH v4 05/27] pnfs: let layoutcommit handle a list of lseg Jim Rees
2011-07-28 18:52   ` Boaz Harrosh
2011-07-28 17:30 ` [PATCH v4 06/27] pnfs: use lwb as layoutcommit length Jim Rees
2011-07-28 17:30 ` [PATCH v4 07/27] NFS41: save layoutcommit cred in layout header init Jim Rees
2011-07-28 17:30 ` [PATCH v4 08/27] pnfs: ask for layout_blksize and save it in nfs_server Jim Rees
2011-07-28 17:30 ` [PATCH v4 09/27] pnfs: cleanup_layoutcommit Jim Rees
2011-07-28 18:26   ` Boaz Harrosh
2011-07-29  3:16     ` Jim Rees
2011-07-28 17:30 ` [PATCH v4 10/27] pnfsblock: add blocklayout Kconfig option, Makefile, and stubs Jim Rees
2011-07-28 17:31 ` [PATCH v4 11/27] pnfsblock: use pageio_ops api Jim Rees
2011-07-28 17:31 ` [PATCH v4 12/27] pnfsblock: basic extent code Jim Rees
2011-07-28 17:31 ` [PATCH v4 13/27] pnfsblock: add device operations Jim Rees
2011-07-28 17:31 ` [PATCH v4 14/27] pnfsblock: remove " Jim Rees
2011-07-28 17:31 ` [PATCH v4 15/27] pnfsblock: lseg alloc and free Jim Rees
2011-07-28 17:31 ` [PATCH v4 16/27] pnfsblock: merge extents Jim Rees
2011-07-28 17:31 ` [PATCH v4 17/27] pnfsblock: call and parse getdevicelist Jim Rees
2011-07-28 17:31 ` [PATCH v4 18/27] pnfsblock: xdr decode pnfs_block_layout4 Jim Rees
2011-07-28 17:31 ` [PATCH v4 19/27] pnfsblock: bl_find_get_extent Jim Rees
2011-07-28 17:31 ` [PATCH v4 20/27] pnfsblock: add extent manipulation functions Jim Rees
2011-07-28 17:31 ` [PATCH v4 21/27] pnfsblock: merge rw extents Jim Rees
2011-07-28 17:31 ` [PATCH v4 22/27] pnfsblock: encode_layoutcommit Jim Rees
2011-07-28 17:31 ` [PATCH v4 23/27] pnfsblock: cleanup_layoutcommit Jim Rees
2011-07-28 17:31 ` [PATCH v4 24/27] pnfsblock: bl_read_pagelist Jim Rees
2011-07-28 17:31 ` [PATCH v4 25/27] pnfsblock: bl_write_pagelist Jim Rees
2011-07-28 17:31 ` [PATCH v4 26/27] pnfsblock: note written INVAL areas for layoutcommit Jim Rees
2011-07-28 17:31 ` [PATCH v4 27/27] pnfsblock: write_pagelist handle zero invalid extents Jim Rees
2011-07-29 15:51 ` [PATCH v4 00/27] add block layout driver to pnfs client Christoph Hellwig
2011-07-29 17:45   ` Peng Tao
2011-07-29 18:44     ` Christoph Hellwig
2011-07-29 18:54   ` Jim Rees
2011-07-29 19:01     ` Christoph Hellwig
2011-07-29 19:13       ` Jim Rees
2011-07-30  1:09         ` Trond Myklebust
2011-07-30  3:26           ` Jim Rees
2011-07-30 14:25             ` Peng Tao
2011-08-01 21:10               ` Trond Myklebust
2011-08-01 22:35                 ` Trond Myklebust
2011-08-01 22:57                   ` Andy Adamson
2011-08-01 23:11                     ` Trond Myklebust
2011-08-02 17:30                       ` Trond Myklebust
2011-08-02 18:50                         ` [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour Trond Myklebust
2011-08-02 18:50                           ` [PATCH v2 2/2] NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resets Trond Myklebust
2011-08-03  8:52                           ` [PATCH v2 1/2] NFSv4.1: Fix the callback 'highest_used_slotid' behaviour Peng Tao
2011-08-02  2:21                   ` [PATCH v4 00/27] add block layout driver to pnfs client Jim Rees
2011-08-02  2:29                     ` Myklebust, Trond
2011-08-02  3:23                       ` Jim Rees
2011-08-02 12:28                         ` Trond Myklebust
2011-08-02 12:56                           ` Jim Rees
2011-08-03  1:48                           ` Jim Rees
2011-08-03  2:07                             ` Myklebust, Trond
     [not found]                               ` <2E1EB2CF9ED1CB4AA966F0EB76EAB4430A778AE2-hX7t0kiaRRrlMGe9HJ1VYQK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
2011-08-03  2:11                                 ` Jim Rees
2011-08-03  2:38                               ` Jim Rees
2011-08-03  8:43                                 ` Peng Tao
2011-08-03 11:49                                   ` Jim Rees
2011-08-03 11:53                                   ` Jim Rees
2011-08-03 13:59                                     ` Peng Tao
2011-08-03 14:11                                       ` Jim Rees
2011-07-30 14:18           ` Jim Rees

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).