All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] recoalesce when ld read/write fails
@ 2011-08-13  1:00 Peng Tao
  2011-08-13  1:00 ` [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails Peng Tao
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Peng Tao @ 2011-08-13  1:00 UTC (permalink / raw)
  To: benny; +Cc: bharrosh, linux-nfs

Hi,

I have moved the error handling inside mds_ops->rpc_release to reuse code
as suggested by Boaz.

I think we still need to issue the IO even for write because
we don't know if current writeback is the last one. So if we re-dirty the
pages and this is the last flush (flush at file close), then we don't have
a later flusher to writeback the re-dirtied pages. Boaz, please help see if
current approach is OK. Thanks.

The two cleanup patches (pipe upcall and set_lo_fail) are seperated out of
this patchset so they can be merged more easily.

Thanks,
Tao

Peng Tao (3):
  pNFS: recoalesce when ld write pagelist fails
  pNFS: recoalesce when ld read pagelist fails
  pNFS: introduce pnfs private workqueue

 fs/nfs/blocklayout/blocklayout.c |   17 +++++--
 fs/nfs/objlayout/objio_osd.c     |    8 +++
 fs/nfs/objlayout/objlayout.c     |    4 +-
 fs/nfs/pnfs.c                    |   92 +++++++++++++++++++++++++++-----------
 fs/nfs/pnfs.h                    |    8 +++-
 fs/nfs/read.c                    |   13 +++++-
 fs/nfs/write.c                   |   25 ++++++++++-
 7 files changed, 129 insertions(+), 38 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails
  2011-08-13  1:00 [PATCH v2 0/3] recoalesce when ld read/write fails Peng Tao
@ 2011-08-13  1:00 ` Peng Tao
  2011-08-23 14:45   ` Peng Tao
  2011-08-13  1:00 ` [PATCH v2 2/3] pNFS: recoalesce when ld read " Peng Tao
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Peng Tao @ 2011-08-13  1:00 UTC (permalink / raw)
  To: benny; +Cc: bharrosh, linux-nfs, Peng Tao

For pnfs pagelist write failure, we need to pg_recoalesce and resend
IO to mds.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/pnfs.c  |   20 +++++++-------------
 fs/nfs/pnfs.h  |    2 +-
 fs/nfs/write.c |   25 ++++++++++++++++++++++++-
 3 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e550e88..ad0579a 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1168,23 +1168,17 @@ EXPORT_SYMBOL_GPL(pnfs_generic_pg_test);
 /*
  * Called by non rpc-based layout drivers
  */
-int
-pnfs_ld_write_done(struct nfs_write_data *data)
+void pnfs_ld_write_done(struct nfs_write_data *data)
 {
-	int status;
-
-	if (!data->pnfs_error) {
+	if (likely(!data->pnfs_error)) {
 		pnfs_set_layoutcommit(data);
 		data->mds_ops->rpc_call_done(&data->task, data);
-		data->mds_ops->rpc_release(data);
-		return 0;
+	} else {
+		put_lseg(data->lseg);
+		data->lseg = NULL;
+		dprintk("pnfs write error = %d\n", data->pnfs_error);
 	}
-
-	dprintk("%s: pnfs_error=%d, retry via MDS\n", __func__,
-		data->pnfs_error);
-	status = nfs_initiate_write(data, NFS_CLIENT(data->inode),
-				    data->mds_ops, NFS_FILE_SYNC);
-	return status ? : -EAGAIN;
+	data->mds_ops->rpc_release(data);
 }
 EXPORT_SYMBOL_GPL(pnfs_ld_write_done);
 
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 01cbfd5..0fb0a41 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -200,7 +200,7 @@ void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
 void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data);
 int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
 int _pnfs_return_layout(struct inode *);
-int pnfs_ld_write_done(struct nfs_write_data *);
+void pnfs_ld_write_done(struct nfs_write_data *);
 int pnfs_ld_read_done(struct nfs_read_data *);
 struct pnfs_layout_segment *pnfs_update_layout(struct inode *ino,
 					       struct nfs_open_context *ctx,
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b39b37f..26b8d90 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1165,7 +1165,13 @@ static void nfs_writeback_done_full(struct rpc_task *task, void *calldata)
 static void nfs_writeback_release_full(void *calldata)
 {
 	struct nfs_write_data	*data = calldata;
-	int status = data->task.tk_status;
+	int ret, status = data->task.tk_status;
+	struct nfs_pageio_descriptor pgio;
+
+	if (data->pnfs_error) {
+		nfs_pageio_init_write_mds(&pgio, data->inode, FLUSH_STABLE);
+		pgio.pg_recoalesce = 1;
+	}
 
 	/* Update attributes as result of writeback. */
 	while (!list_empty(&data->pages)) {
@@ -1181,6 +1187,11 @@ static void nfs_writeback_release_full(void *calldata)
 			req->wb_bytes,
 			(long long)req_offset(req));
 
+		if (data->pnfs_error) {
+			dprintk(", pnfs error = %d\n", data->pnfs_error);
+			goto next;
+		}
+
 		if (status < 0) {
 			nfs_set_pageerror(page);
 			nfs_context_set_write_error(req->wb_context, status);
@@ -1200,7 +1211,19 @@ remove_request:
 	next:
 		nfs_clear_page_tag_locked(req);
 		nfs_end_page_writeback(page);
+		if (data->pnfs_error) {
+			lock_page(page);
+			nfs_pageio_cond_complete(&pgio, page->index);
+			ret = nfs_page_async_flush(&pgio, page, 0);
+			if (ret) {
+				nfs_set_pageerror(page);
+				dprintk(", error = %d\n", ret);
+			}
+			unlock_page(page);
+		}
 	}
+	if (data->pnfs_error)
+		nfs_pageio_complete(&pgio);
 	nfs_writedata_release(calldata);
 }
 
-- 
1.7.1.262.g5ef3d


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/3] pNFS: recoalesce when ld read pagelist fails
  2011-08-13  1:00 [PATCH v2 0/3] recoalesce when ld read/write fails Peng Tao
  2011-08-13  1:00 ` [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails Peng Tao
@ 2011-08-13  1:00 ` Peng Tao
  2011-08-13  1:00 ` [PATCH v2 3/3] pNFS: introduce pnfs private workqueue Peng Tao
  2011-08-16 20:43 ` [PATCH v2 0/3] recoalesce when ld read/write fails Boaz Harrosh
  3 siblings, 0 replies; 12+ messages in thread
From: Peng Tao @ 2011-08-13  1:00 UTC (permalink / raw)
  To: benny; +Cc: bharrosh, linux-nfs, Peng Tao

For pnfs pagelist read failure, we need to pg_recoalesce and resend
IO to mds.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/pnfs.c |   20 +++++++-------------
 fs/nfs/pnfs.h |    2 +-
 fs/nfs/read.c |   13 ++++++++++++-
 3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index ad0579a..84ecd41 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1262,23 +1262,17 @@ EXPORT_SYMBOL_GPL(pnfs_generic_pg_writepages);
 /*
  * Called by non rpc-based layout drivers
  */
-int
-pnfs_ld_read_done(struct nfs_read_data *data)
+void pnfs_ld_read_done(struct nfs_read_data *data)
 {
-	int status;
-
-	if (!data->pnfs_error) {
+	if (likely(!data->pnfs_error)) {
 		__nfs4_read_done_cb(data);
 		data->mds_ops->rpc_call_done(&data->task, data);
-		data->mds_ops->rpc_release(data);
-		return 0;
+	} else {
+		put_lseg(data->lseg);
+		data->lseg = NULL;
+		dprintk("pnfs write error = %d\n", data->pnfs_error);
 	}
-
-	dprintk("%s: pnfs_error=%d, retry via MDS\n", __func__,
-		data->pnfs_error);
-	status = nfs_initiate_read(data, NFS_CLIENT(data->inode),
-				   data->mds_ops);
-	return status ? : -EAGAIN;
+	data->mds_ops->rpc_release(data);
 }
 EXPORT_SYMBOL_GPL(pnfs_ld_read_done);
 
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 0fb0a41..506f67c 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -201,7 +201,7 @@ void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data);
 int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
 int _pnfs_return_layout(struct inode *);
 void pnfs_ld_write_done(struct nfs_write_data *);
-int pnfs_ld_read_done(struct nfs_read_data *);
+void pnfs_ld_read_done(struct nfs_read_data *);
 struct pnfs_layout_segment *pnfs_update_layout(struct inode *ino,
 					       struct nfs_open_context *ctx,
 					       loff_t pos,
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 2171c04..2a46aaa 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -541,13 +541,24 @@ static void nfs_readpage_result_full(struct rpc_task *task, void *calldata)
 static void nfs_readpage_release_full(void *calldata)
 {
 	struct nfs_read_data *data = calldata;
+	struct nfs_pageio_descriptor pgio;
 
+	if (data->pnfs_error) {
+		nfs_pageio_init_read_mds(&pgio, data->inode);
+		pgio.pg_recoalesce = 1;
+	}
 	while (!list_empty(&data->pages)) {
 		struct nfs_page *req = nfs_list_entry(data->pages.next);
 
 		nfs_list_remove_request(req);
-		nfs_readpage_release(req);
+		if (!data->pnfs_error) {
+			nfs_readpage_release(req);
+		} else {
+			nfs_pageio_add_request(&pgio, req);
+		}
 	}
+	if (data->pnfs_error)
+		nfs_pageio_complete(&pgio);
 	nfs_readdata_release(calldata);
 }
 
-- 
1.7.1.262.g5ef3d


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/3] pNFS: introduce pnfs private workqueue
  2011-08-13  1:00 [PATCH v2 0/3] recoalesce when ld read/write fails Peng Tao
  2011-08-13  1:00 ` [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails Peng Tao
  2011-08-13  1:00 ` [PATCH v2 2/3] pNFS: recoalesce when ld read " Peng Tao
@ 2011-08-13  1:00 ` Peng Tao
  2011-08-16 20:43 ` [PATCH v2 0/3] recoalesce when ld read/write fails Boaz Harrosh
  3 siblings, 0 replies; 12+ messages in thread
From: Peng Tao @ 2011-08-13  1:00 UTC (permalink / raw)
  To: benny; +Cc: bharrosh, linux-nfs, Peng Tao

For layoutdriver io done functions, default workqueue is not a good place
as the code is executed in IO path. So add a pnfs private workqueue to handle
them.

Also change block and object layout code to make use of this private workqueue.

Signed-off-by: Peng Tao <peng_tao@emc.com>
---
 fs/nfs/blocklayout/blocklayout.c |   17 ++++++++---
 fs/nfs/objlayout/objio_osd.c     |    8 ++++++
 fs/nfs/objlayout/objlayout.c     |    4 +-
 fs/nfs/pnfs.c                    |   52 +++++++++++++++++++++++++++++++++++++-
 fs/nfs/pnfs.h                    |    4 +++
 5 files changed, 77 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 9561c8f..18dcab4 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -228,7 +228,7 @@ bl_end_par_io_read(void *data)
 	struct nfs_read_data *rdata = data;
 
 	INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
-	schedule_work(&rdata->task.u.tk_work);
+	pnfsiod_queue_work(&rdata->task.u.tk_work);
 }
 
 /* We don't want normal .rpc_call_done callback used, so we replace it
@@ -418,7 +418,7 @@ static void bl_end_par_io_write(void *data)
 	wdata->task.tk_status = 0;
 	wdata->verf.committed = NFS_FILE_SYNC;
 	INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
-	schedule_work(&wdata->task.u.tk_work);
+	pnfsiod_queue_work(&wdata->task.u.tk_work);
 }
 
 /* FIXME STUB - mark intersection of layout and page as bad, so is not
@@ -977,29 +977,35 @@ static int __init nfs4blocklayout_init(void)
 	if (ret)
 		goto out;
 
+	ret = pnfsiod_start();
+	if (ret)
+		goto out_remove;
+
 	init_waitqueue_head(&bl_wq);
 
 	mnt = rpc_get_mount();
 	if (IS_ERR(mnt)) {
 		ret = PTR_ERR(mnt);
-		goto out_remove;
+		goto out_stop;
 	}
 
 	ret = vfs_path_lookup(mnt->mnt_root,
 			      mnt,
 			      NFS_PIPE_DIRNAME, 0, &path);
 	if (ret)
-		goto out_remove;
+		goto out_stop;
 
 	bl_device_pipe = rpc_mkpipe(path.dentry, "blocklayout", NULL,
 				    &bl_upcall_ops, 0);
 	if (IS_ERR(bl_device_pipe)) {
 		ret = PTR_ERR(bl_device_pipe);
-		goto out_remove;
+		goto out_stop;
 	}
 out:
 	return ret;
 
+out_stop:
+	pnfsiod_stop();
 out_remove:
 	pnfs_unregister_layoutdriver(&blocklayout_type);
 	return ret;
@@ -1011,6 +1017,7 @@ static void __exit nfs4blocklayout_exit(void)
 	       __func__);
 
 	pnfs_unregister_layoutdriver(&blocklayout_type);
+	pnfsiod_stop();
 	rpc_unlink(bl_device_pipe);
 }
 
diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
index d0cda12..f28013f 100644
--- a/fs/nfs/objlayout/objio_osd.c
+++ b/fs/nfs/objlayout/objio_osd.c
@@ -1042,7 +1042,14 @@ static int __init
 objlayout_init(void)
 {
 	int ret = pnfs_register_layoutdriver(&objlayout_type);
+	if (ret)
+		goto out;
 
+	ret = pnfsiod_start();
+	if (ret)
+		pnfs_unregister_layoutdriver(&objlayout_type);
+
+out:
 	if (ret)
 		printk(KERN_INFO
 			"%s: Registering OSD pNFS Layout Driver failed: error=%d\n",
@@ -1057,6 +1064,7 @@ static void __exit
 objlayout_exit(void)
 {
 	pnfs_unregister_layoutdriver(&objlayout_type);
+	pnfsiod_stop();
 	printk(KERN_INFO "%s: Unregistered OSD pNFS Layout Driver\n",
 	       __func__);
 }
diff --git a/fs/nfs/objlayout/objlayout.c b/fs/nfs/objlayout/objlayout.c
index 1d06f8e..f7c6c21 100644
--- a/fs/nfs/objlayout/objlayout.c
+++ b/fs/nfs/objlayout/objlayout.c
@@ -305,7 +305,7 @@ objlayout_read_done(struct objlayout_io_state *state, ssize_t status, bool sync)
 		pnfs_ld_read_done(rdata);
 	else {
 		INIT_WORK(&rdata->task.u.tk_work, _rpc_read_complete);
-		schedule_work(&rdata->task.u.tk_work);
+		pnfsiod_queue_work(&rdata->task.u.tk_work);
 	}
 }
 
@@ -396,7 +396,7 @@ objlayout_write_done(struct objlayout_io_state *state, ssize_t status,
 		pnfs_ld_write_done(wdata);
 	else {
 		INIT_WORK(&wdata->task.u.tk_work, _rpc_write_complete);
-		schedule_work(&wdata->task.u.tk_work);
+		pnfsiod_queue_work(&wdata->task.u.tk_work);
 	}
 }
 
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 84ecd41..329befe 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -38,7 +38,7 @@
 /* Locking:
  *
  * pnfs_spinlock:
- *      protects pnfs_modules_tbl.
+ *      protects pnfs_modules_tbl, pnfsiod_workqueue and pnfsiod_users.
  */
 static DEFINE_SPINLOCK(pnfs_spinlock);
 
@@ -47,6 +47,9 @@ static DEFINE_SPINLOCK(pnfs_spinlock);
  */
 static LIST_HEAD(pnfs_modules_tbl);
 
+static struct workqueue_struct *pnfsiod_workqueue;
+static int pnfsiod_users = 0;
+
 /* Return the registered pnfs layout driver module matching given id */
 static struct pnfs_layoutdriver_type *
 find_pnfs_driver_locked(u32 id)
@@ -1466,3 +1469,50 @@ out:
 	dprintk("<-- %s status %d\n", __func__, status);
 	return status;
 }
+
+/*
+ * start up the pnfsiod workqueue
+ */
+int pnfsiod_start(void)
+{
+	struct workqueue_struct *wq;
+	dprintk("RPC:       creating workqueue pnfsiod\n");
+	wq = alloc_workqueue("pnfsiod", WQ_MEM_RECLAIM, 0);
+	if (wq == NULL)
+		return -ENOMEM;
+	spin_lock(&pnfs_spinlock);
+	pnfsiod_users++;
+	if (pnfsiod_workqueue == NULL) {
+		pnfsiod_workqueue = wq;
+	} else {
+		destroy_workqueue(wq);
+	}
+	spin_unlock(&pnfs_spinlock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pnfsiod_start);
+
+/*
+ * Destroy the pnfsiod workqueue
+ */
+void pnfsiod_stop(void)
+{
+	struct workqueue_struct *wq = NULL;
+
+	spin_lock(&pnfs_spinlock);
+	pnfsiod_users--;
+	if (pnfsiod_users == 0) {
+		wq = pnfsiod_workqueue;
+		pnfsiod_workqueue = NULL;
+	}
+	spin_unlock(&pnfs_spinlock);
+	if (wq)
+		destroy_workqueue(wq);
+}
+EXPORT_SYMBOL_GPL(pnfsiod_stop);
+
+void pnfsiod_queue_work(struct work_struct* work)
+{
+	queue_work(pnfsiod_workqueue, work);
+}
+EXPORT_SYMBOL_GPL(pnfsiod_queue_work);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 506f67c..3e159f2 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -165,6 +165,10 @@ extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
 extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp);
 
 /* pnfs.c */
+int pnfsiod_start(void);
+void pnfsiod_stop(void);
+void pnfsiod_queue_work(struct work_struct* work);
+
 void get_layout_hdr(struct pnfs_layout_hdr *lo);
 void put_lseg(struct pnfs_layout_segment *lseg);
 
-- 
1.7.1.262.g5ef3d


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] recoalesce when ld read/write fails
  2011-08-13  1:00 [PATCH v2 0/3] recoalesce when ld read/write fails Peng Tao
                   ` (2 preceding siblings ...)
  2011-08-13  1:00 ` [PATCH v2 3/3] pNFS: introduce pnfs private workqueue Peng Tao
@ 2011-08-16 20:43 ` Boaz Harrosh
  2011-08-17 13:42   ` Peng Tao
  3 siblings, 1 reply; 12+ messages in thread
From: Boaz Harrosh @ 2011-08-16 20:43 UTC (permalink / raw)
  To: Peng Tao; +Cc: benny, linux-nfs

On 08/12/2011 06:00 PM, Peng Tao wrote:
> Hi,
> 
> I have moved the error handling inside mds_ops->rpc_release to reuse code
> as suggested by Boaz.
> 
> I think we still need to issue the IO even for write because
> we don't know if current writeback is the last one. So if we re-dirty the
> pages and this is the last flush (flush at file close), then we don't have
> a later flusher to writeback the re-dirtied pages. Boaz, please help see if
> current approach is OK. Thanks.
> 
> The two cleanup patches (pipe upcall and set_lo_fail) are seperated out of
> this patchset so they can be merged more easily.
> 
> Thanks,
> Tao
> 

Thanks Tao

They look *really* good these patches. But as you said, do they actually work?

Did you test any of this? I mean do you have a facility to inject random
IO errors and test this out?

At the time I had the most dumb patch that would simply start failing IO after
n writes/reads. Or another one that: if the offset was a magic offset like
0x10017 then the IO would fail. Then with a simple dd I could make the IO
fail and test this out. (Which BTW never worked with pnfs)

The Kernel might be more resilient then we think with regarding to waiting
for clean pages. Because I know that in a UML it is very easy to get to
a low-memory condition where the iscsi stack starts to throw OOM messages
and everything stalls for a while. But I'm always surprised how eventually
the Kernel picks up and the IO ends successfully. One easy way to do this
is with a "git clone linux" with a machine that has a fast (iscsi) disk but
only 256M of memory.

I will only have time to test all this at end of next week at the earliest.
So thanks again for looking into this

Boaz

> Peng Tao (3):
>   pNFS: recoalesce when ld write pagelist fails
>   pNFS: recoalesce when ld read pagelist fails
>   pNFS: introduce pnfs private workqueue
> 
>  fs/nfs/blocklayout/blocklayout.c |   17 +++++--
>  fs/nfs/objlayout/objio_osd.c     |    8 +++
>  fs/nfs/objlayout/objlayout.c     |    4 +-
>  fs/nfs/pnfs.c                    |   92 +++++++++++++++++++++++++++-----------
>  fs/nfs/pnfs.h                    |    8 +++-
>  fs/nfs/read.c                    |   13 +++++-
>  fs/nfs/write.c                   |   25 ++++++++++-
>  7 files changed, 129 insertions(+), 38 deletions(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] recoalesce when ld read/write fails
  2011-08-16 20:43 ` [PATCH v2 0/3] recoalesce when ld read/write fails Boaz Harrosh
@ 2011-08-17 13:42   ` Peng Tao
  2011-08-22 23:24     ` Boaz Harrosh
  0 siblings, 1 reply; 12+ messages in thread
From: Peng Tao @ 2011-08-17 13:42 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: benny, linux-nfs

Hi, Boaz,

On Wed, Aug 17, 2011 at 4:43 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
> On 08/12/2011 06:00 PM, Peng Tao wrote:
>> Hi,
>>
>> I have moved the error handling inside mds_ops->rpc_release to reuse code
>> as suggested by Boaz.
>>
>> I think we still need to issue the IO even for write because
>> we don't know if current writeback is the last one. So if we re-dirty the
>> pages and this is the last flush (flush at file close), then we don't have
>> a later flusher to writeback the re-dirtied pages. Boaz, please help see if
>> current approach is OK. Thanks.
>>
>> The two cleanup patches (pipe upcall and set_lo_fail) are seperated out of
>> this patchset so they can be merged more easily.
>>
>> Thanks,
>> Tao
>>
>
> Thanks Tao
>
> They look *really* good these patches. But as you said, do they actually work?
>
> Did you test any of this? I mean do you have a facility to inject random
> IO errors and test this out?
I have tested the patchset by forcing all bio to fail so IO will be
re-send to MDS. Also I tested read/write by forcing all IO that
touches pages beyond 16K to fail. Both tests passed.

>
> At the time I had the most dumb patch that would simply start failing IO after
> n writes/reads. Or another one that: if the offset was a magic offset like
> 0x10017 then the IO would fail. Then with a simple dd I could make the IO
> fail and test this out. (Which BTW never worked with pnfs)
>
> The Kernel might be more resilient then we think with regarding to waiting
> for clean pages. Because I know that in a UML it is very easy to get to
> a low-memory condition where the iscsi stack starts to throw OOM messages
> and everything stalls for a while. But I'm always surprised how eventually
> the Kernel picks up and the IO ends successfully. One easy way to do this
> is with a "git clone linux" with a machine that has a fast (iscsi) disk but
> only 256M of memory.
I tried to test "git clone linux" on a 384M memory VM but it hangs as
server is rejecting layoutget with NFS4ERR_LAYOUTTRYLATER.
/proc/self/mountstats shows that client has sent more than 14K
layoutgets but never returns any layout. I checked related code and it
seems client should send layoutreturn for closed files. Will git keep
all files open? Or is there any case that client won't return the
layout even after file close?

This seems to be another issue though.

Thanks,
Tao

>
> I will only have time to test all this at end of next week at the earliest.
> So thanks again for looking into this
>
> Boaz
>
>> Peng Tao (3):
>>   pNFS: recoalesce when ld write pagelist fails
>>   pNFS: recoalesce when ld read pagelist fails
>>   pNFS: introduce pnfs private workqueue
>>
>>  fs/nfs/blocklayout/blocklayout.c |   17 +++++--
>>  fs/nfs/objlayout/objio_osd.c     |    8 +++
>>  fs/nfs/objlayout/objlayout.c     |    4 +-
>>  fs/nfs/pnfs.c                    |   92 +++++++++++++++++++++++++++-----------
>>  fs/nfs/pnfs.h                    |    8 +++-
>>  fs/nfs/read.c                    |   13 +++++-
>>  fs/nfs/write.c                   |   25 ++++++++++-
>>  7 files changed, 129 insertions(+), 38 deletions(-)
>>
>
>



-- 
Thanks,
-Bergwolf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] recoalesce when ld read/write fails
  2011-08-17 13:42   ` Peng Tao
@ 2011-08-22 23:24     ` Boaz Harrosh
  2011-08-23 14:31       ` Peng Tao
  0 siblings, 1 reply; 12+ messages in thread
From: Boaz Harrosh @ 2011-08-22 23:24 UTC (permalink / raw)
  To: Peng Tao; +Cc: benny, linux-nfs

On 08/17/2011 06:42 AM, Peng Tao wrote:
<snip>
>> They look *really* good these patches. But as you said, do they actually work?
>>
>> Did you test any of this? I mean do you have a facility to inject random
>> IO errors and test this out?
> I have tested the patchset by forcing all bio to fail so IO will be
> re-send to MDS. Also I tested read/write by forcing all IO that
> touches pages beyond 16K to fail. Both tests passed.
> 

So that sounds good. Why don't you like these patches?

>>
<snip>
> I tried to test "git clone linux" on a 384M memory VM but it hangs as
> server is rejecting layoutget with NFS4ERR_LAYOUTTRYLATER.
> /proc/self/mountstats shows that client has sent more than 14K
> layoutgets but never returns any layout. I checked related code and it
> seems client should send layoutreturn for closed files. Will git keep
> all files open? Or is there any case that client won't return the
> layout even after file close?
> 

Yes! our client will never ever return any layouts. The forgetful
model for you. It will only ever "layout_return" on evict_inode which
is way after close, when the Kernel decides to cleanup the inode
cache. Which only happens after a while.

If Your layout driver sets the RETURN_ON_CLOSE bit on the layout
then the pNFSD server code will simulate a layout_return on a file
close. This is what I use in EXOFS and it works very well. (I should
know because I wrote this patch for the pnfsd server)

It looks like your blocks-based pNFSD filesystem needs layouts
returned. Set the res->lg_return_on_close = true; flag in your
.layout_get member and you should see layout_return on close.

Look in Benny's pNFS tree at fs/exofs/export.c file how I do
it there.

> This seems to be another issue though.
> 

Yes this is a different Issue.

It looks from what you tested that the second approach works
well. I'll also test this out later on. Lets not rule these
out yet.

> Thanks,
> Tao

Sorry for the late response I just came back from Vacation,
It'll me some time to catch up

Thanks
Boaz

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/3] recoalesce when ld read/write fails
  2011-08-22 23:24     ` Boaz Harrosh
@ 2011-08-23 14:31       ` Peng Tao
  0 siblings, 0 replies; 12+ messages in thread
From: Peng Tao @ 2011-08-23 14:31 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: benny, linux-nfs

Hi, Boaz,

On Tue, Aug 23, 2011 at 7:24 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
> On 08/17/2011 06:42 AM, Peng Tao wrote:
> <snip>
>>> They look *really* good these patches. But as you said, do they actually work?
>>>
>>> Did you test any of this? I mean do you have a facility to inject random
>>> IO errors and test this out?
>> I have tested the patchset by forcing all bio to fail so IO will be
>> re-send to MDS. Also I tested read/write by forcing all IO that
>> touches pages beyond 16K to fail. Both tests passed.
>>
>
> So that sounds good. Why don't you like these patches?
Well, I do like them :)

>
>>>
> <snip>
>> I tried to test "git clone linux" on a 384M memory VM but it hangs as
>> server is rejecting layoutget with NFS4ERR_LAYOUTTRYLATER.
>> /proc/self/mountstats shows that client has sent more than 14K
>> layoutgets but never returns any layout. I checked related code and it
>> seems client should send layoutreturn for closed files. Will git keep
>> all files open? Or is there any case that client won't return the
>> layout even after file close?
>>
>
> Yes! our client will never ever return any layouts. The forgetful
> model for you. It will only ever "layout_return" on evict_inode which
> is way after close, when the Kernel decides to cleanup the inode
> cache. Which only happens after a while.
>
> If Your layout driver sets the RETURN_ON_CLOSE bit on the layout
> then the pNFSD server code will simulate a layout_return on a file
> close. This is what I use in EXOFS and it works very well. (I should
> know because I wrote this patch for the pnfsd server)
>
> It looks like your blocks-based pNFSD filesystem needs layouts
> returned. Set the res->lg_return_on_close = true; flag in your
> .layout_get member and you should see layout_return on close.
>
> Look in Benny's pNFS tree at fs/exofs/export.c file how I do
> it there.
Thanks for the explanation. Setting return_on_close may lose the
chance to reuse layout segments for reopen cases. I will look into it
more closely.

>
>> This seems to be another issue though.
>>
>
> Yes this is a different Issue.
>
> It looks from what you tested that the second approach works
> well. I'll also test this out later on. Lets not rule these
> out yet.
Thanks again for helping to test them.

Cheers,
Tao
>
>> Thanks,
>> Tao
>
> Sorry for the late response I just came back from Vacation,
> It'll me some time to catch up
>
> Thanks
> Boaz
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails
  2011-08-13  1:00 ` [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails Peng Tao
@ 2011-08-23 14:45   ` Peng Tao
  2011-08-27  2:43     ` Boaz Harrosh
  0 siblings, 1 reply; 12+ messages in thread
From: Peng Tao @ 2011-08-23 14:45 UTC (permalink / raw)
  To: benny, Boaz Harrosh; +Cc: linux-nfs, Peng Tao

Hi, Benny and Boaz,

On Sat, Aug 13, 2011 at 9:00 AM, Peng Tao <bergwolf@gmail.com> wrote:
> For pnfs pagelist write failure, we need to pg_recoalesce and resend
> IO to mds.
>
> Signed-off-by: Peng Tao <peng_tao@emc.com>
> ---
>  fs/nfs/pnfs.c  |   20 +++++++-------------
>  fs/nfs/pnfs.h  |    2 +-
>  fs/nfs/write.c |   25 ++++++++++++++++++++++++-
>  3 files changed, 32 insertions(+), 15 deletions(-)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index e550e88..ad0579a 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1168,23 +1168,17 @@ EXPORT_SYMBOL_GPL(pnfs_generic_pg_test);
>  /*
>  * Called by non rpc-based layout drivers
>  */
> -int
> -pnfs_ld_write_done(struct nfs_write_data *data)
> +void pnfs_ld_write_done(struct nfs_write_data *data)
>  {
> -       int status;
> -
> -       if (!data->pnfs_error) {
> +       if (likely(!data->pnfs_error)) {
>                pnfs_set_layoutcommit(data);
>                data->mds_ops->rpc_call_done(&data->task, data);
> -               data->mds_ops->rpc_release(data);
> -               return 0;
> +       } else {
> +               put_lseg(data->lseg);
> +               data->lseg = NULL;
> +               dprintk("pnfs write error = %d\n", data->pnfs_error);
>        }
> -
> -       dprintk("%s: pnfs_error=%d, retry via MDS\n", __func__,
> -               data->pnfs_error);
> -       status = nfs_initiate_write(data, NFS_CLIENT(data->inode),
> -                                   data->mds_ops, NFS_FILE_SYNC);
> -       return status ? : -EAGAIN;
> +       data->mds_ops->rpc_release(data);
>  }
>  EXPORT_SYMBOL_GPL(pnfs_ld_write_done);
>
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 01cbfd5..0fb0a41 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -200,7 +200,7 @@ void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
>  void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data);
>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>  int _pnfs_return_layout(struct inode *);
> -int pnfs_ld_write_done(struct nfs_write_data *);
> +void pnfs_ld_write_done(struct nfs_write_data *);
>  int pnfs_ld_read_done(struct nfs_read_data *);
>  struct pnfs_layout_segment *pnfs_update_layout(struct inode *ino,
>                                               struct nfs_open_context *ctx,
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b39b37f..26b8d90 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -1165,7 +1165,13 @@ static void nfs_writeback_done_full(struct rpc_task *task, void *calldata)
>  static void nfs_writeback_release_full(void *calldata)
>  {
>        struct nfs_write_data   *data = calldata;
> -       int status = data->task.tk_status;
> +       int ret, status = data->task.tk_status;
> +       struct nfs_pageio_descriptor pgio;
> +
> +       if (data->pnfs_error) {
> +               nfs_pageio_init_write_mds(&pgio, data->inode, FLUSH_STABLE);
> +               pgio.pg_recoalesce = 1;
> +       }
>
>        /* Update attributes as result of writeback. */
>        while (!list_empty(&data->pages)) {
> @@ -1181,6 +1187,11 @@ static void nfs_writeback_release_full(void *calldata)
>                        req->wb_bytes,
>                        (long long)req_offset(req));
>
> +               if (data->pnfs_error) {
> +                       dprintk(", pnfs error = %d\n", data->pnfs_error);
> +                       goto next;
> +               }
> +
>                if (status < 0) {
>                        nfs_set_pageerror(page);
>                        nfs_context_set_write_error(req->wb_context, status);
> @@ -1200,7 +1211,19 @@ remove_request:
>        next:
>                nfs_clear_page_tag_locked(req);
>                nfs_end_page_writeback(page);
> +               if (data->pnfs_error) {
> +                       lock_page(page);
> +                       nfs_pageio_cond_complete(&pgio, page->index);
> +                       ret = nfs_page_async_flush(&pgio, page, 0);
> +                       if (ret) {
> +                               nfs_set_pageerror(page);
My only concern about the patch is here. If we fail at
nfs_page_async_flush(), is it enough to just call nfs_set_pageerror()
on the page? Do we need to redirty the page as well, and maybe some
other things?

Thanks,
Tao
> +                               dprintk(", error = %d\n", ret);
> +                       }
> +                       unlock_page(page);
> +               }
>        }
> +       if (data->pnfs_error)
> +               nfs_pageio_complete(&pgio);
>        nfs_writedata_release(calldata);
>  }
>
> --
> 1.7.1.262.g5ef3d
>
>



-- 
Thanks,
-Bergwolf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails
  2011-08-23 14:45   ` Peng Tao
@ 2011-08-27  2:43     ` Boaz Harrosh
  2011-08-27 10:47       ` Peng Tao
  0 siblings, 1 reply; 12+ messages in thread
From: Boaz Harrosh @ 2011-08-27  2:43 UTC (permalink / raw)
  To: Peng Tao, Trond Myklebust; +Cc: benny, linux-nfs, Peng Tao

On 08/23/2011 07:45 AM, Peng Tao wrote:
> Hi, Benny and Boaz,
> 
> On Sat, Aug 13, 2011 at 9:00 AM, Peng Tao <bergwolf@gmail.com> wrote:
>> For pnfs pagelist write failure, we need to pg_recoalesce and resend
>> IO to mds.
>>
>> Signed-off-by: Peng Tao <peng_tao@emc.com>
>> ---
>>  fs/nfs/pnfs.c  |   20 +++++++-------------
>>  fs/nfs/pnfs.h  |    2 +-
>>  fs/nfs/write.c |   25 ++++++++++++++++++++++++-
>>  3 files changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index e550e88..ad0579a 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -1168,23 +1168,17 @@ EXPORT_SYMBOL_GPL(pnfs_generic_pg_test);
>>  /*
>>  * Called by non rpc-based layout drivers
>>  */
>> -int
>> -pnfs_ld_write_done(struct nfs_write_data *data)
>> +void pnfs_ld_write_done(struct nfs_write_data *data)
>>  {
>> -       int status;
>> -
>> -       if (!data->pnfs_error) {
>> +       if (likely(!data->pnfs_error)) {
>>                pnfs_set_layoutcommit(data);
>>                data->mds_ops->rpc_call_done(&data->task, data);
>> -               data->mds_ops->rpc_release(data);
>> -               return 0;
>> +       } else {
>> +               put_lseg(data->lseg);
>> +               data->lseg = NULL;
>> +               dprintk("pnfs write error = %d\n", data->pnfs_error);
>>        }
>> -
>> -       dprintk("%s: pnfs_error=%d, retry via MDS\n", __func__,
>> -               data->pnfs_error);
>> -       status = nfs_initiate_write(data, NFS_CLIENT(data->inode),
>> -                                   data->mds_ops, NFS_FILE_SYNC);
>> -       return status ? : -EAGAIN;
>> +       data->mds_ops->rpc_release(data);
>>  }
>>  EXPORT_SYMBOL_GPL(pnfs_ld_write_done);
>>
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index 01cbfd5..0fb0a41 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -200,7 +200,7 @@ void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
>>  void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data);
>>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>>  int _pnfs_return_layout(struct inode *);
>> -int pnfs_ld_write_done(struct nfs_write_data *);
>> +void pnfs_ld_write_done(struct nfs_write_data *);
>>  int pnfs_ld_read_done(struct nfs_read_data *);
>>  struct pnfs_layout_segment *pnfs_update_layout(struct inode *ino,
>>                                               struct nfs_open_context *ctx,
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index b39b37f..26b8d90 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -1165,7 +1165,13 @@ static void nfs_writeback_done_full(struct rpc_task *task, void *calldata)
>>  static void nfs_writeback_release_full(void *calldata)
>>  {
>>        struct nfs_write_data   *data = calldata;
>> -       int status = data->task.tk_status;
>> +       int ret, status = data->task.tk_status;
>> +       struct nfs_pageio_descriptor pgio;
>> +
>> +       if (data->pnfs_error) {
>> +               nfs_pageio_init_write_mds(&pgio, data->inode, FLUSH_STABLE);
>> +               pgio.pg_recoalesce = 1;
>> +       }
>>
>>        /* Update attributes as result of writeback. */
>>        while (!list_empty(&data->pages)) {
>> @@ -1181,6 +1187,11 @@ static void nfs_writeback_release_full(void *calldata)
>>                        req->wb_bytes,
>>                        (long long)req_offset(req));
>>
>> +               if (data->pnfs_error) {
>> +                       dprintk(", pnfs error = %d\n", data->pnfs_error);
>> +                       goto next;
>> +               }
>> +
>>                if (status < 0) {
>>                        nfs_set_pageerror(page);
>>                        nfs_context_set_write_error(req->wb_context, status);
>> @@ -1200,7 +1211,19 @@ remove_request:
>>        next:
>>                nfs_clear_page_tag_locked(req);
>>                nfs_end_page_writeback(page);
>> +               if (data->pnfs_error) {
>> +                       lock_page(page);
>> +                       nfs_pageio_cond_complete(&pgio, page->index);
>> +                       ret = nfs_page_async_flush(&pgio, page, 0);
>> +                       if (ret) {
>> +                               nfs_set_pageerror(page);
> My only concern about the patch is here. If we fail at
> nfs_page_async_flush(), is it enough to just call nfs_set_pageerror()
> on the page? Do we need to redirty the page as well, and maybe some
> other things?
> 

Hum? That's a very good question. nfs_end_page_writeback above does
an end_page_writeback(), which does test_clear_page_writeback().
Now in nfs_page_async_flush() you are adding the page back to the nfs writeback
queue. So the question is when does test_set_page_writeback() is called?
If it is called as part of the re-issue of write by NFS then you are
fine because you clear it here but it will be set later on. But if
it was done on entry into the queue. At the original sight of the call to
nfs_page_async_flush(). Then it will not be set again and it might confuse
the VFS since we will be re-clearing a cleared bit.

So in short we should investigate if nfs_clear_page_tag_locked() &&
nfs_end_page_writeback() sould only be called at the *else* clause
of the "if (unlikely(data->pnfs_error))" above.

I'll be testing this next week and we'll see.

Trond do you know?

Boaz


> Thanks,
> Tao
>> +                               dprintk(", error = %d\n", ret);
>> +                       }
>> +                       unlock_page(page);
>> +               }
>>        }
>> +       if (data->pnfs_error)
>> +               nfs_pageio_complete(&pgio);
>>        nfs_writedata_release(calldata);
>>  }
>>
>> --
>> 1.7.1.262.g5ef3d
>>
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails
  2011-08-27  2:43     ` Boaz Harrosh
@ 2011-08-27 10:47       ` Peng Tao
  2011-08-29 21:16         ` Boaz Harrosh
  0 siblings, 1 reply; 12+ messages in thread
From: Peng Tao @ 2011-08-27 10:47 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Trond Myklebust, benny, linux-nfs, Peng Tao

Hi, Boaz,

On Sat, Aug 27, 2011 at 10:43 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
> On 08/23/2011 07:45 AM, Peng Tao wrote:
>> Hi, Benny and Boaz,
>>
>> On Sat, Aug 13, 2011 at 9:00 AM, Peng Tao <bergwolf@gmail.com> wrote:
>>> For pnfs pagelist write failure, we need to pg_recoalesce and resend
>>> IO to mds.
>>>
>>> Signed-off-by: Peng Tao <peng_tao@emc.com>
>>> ---
>>>  fs/nfs/pnfs.c  |   20 +++++++-------------
>>>  fs/nfs/pnfs.h  |    2 +-
>>>  fs/nfs/write.c |   25 ++++++++++++++++++++++++-
>>>  3 files changed, 32 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>>> index e550e88..ad0579a 100644
>>> --- a/fs/nfs/pnfs.c
>>> +++ b/fs/nfs/pnfs.c
>>> @@ -1168,23 +1168,17 @@ EXPORT_SYMBOL_GPL(pnfs_generic_pg_test);
>>>  /*
>>>  * Called by non rpc-based layout drivers
>>>  */
>>> -int
>>> -pnfs_ld_write_done(struct nfs_write_data *data)
>>> +void pnfs_ld_write_done(struct nfs_write_data *data)
>>>  {
>>> -       int status;
>>> -
>>> -       if (!data->pnfs_error) {
>>> +       if (likely(!data->pnfs_error)) {
>>>                pnfs_set_layoutcommit(data);
>>>                data->mds_ops->rpc_call_done(&data->task, data);
>>> -               data->mds_ops->rpc_release(data);
>>> -               return 0;
>>> +       } else {
>>> +               put_lseg(data->lseg);
>>> +               data->lseg = NULL;
>>> +               dprintk("pnfs write error = %d\n", data->pnfs_error);
>>>        }
>>> -
>>> -       dprintk("%s: pnfs_error=%d, retry via MDS\n", __func__,
>>> -               data->pnfs_error);
>>> -       status = nfs_initiate_write(data, NFS_CLIENT(data->inode),
>>> -                                   data->mds_ops, NFS_FILE_SYNC);
>>> -       return status ? : -EAGAIN;
>>> +       data->mds_ops->rpc_release(data);
>>>  }
>>>  EXPORT_SYMBOL_GPL(pnfs_ld_write_done);
>>>
>>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>>> index 01cbfd5..0fb0a41 100644
>>> --- a/fs/nfs/pnfs.h
>>> +++ b/fs/nfs/pnfs.h
>>> @@ -200,7 +200,7 @@ void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
>>>  void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data);
>>>  int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
>>>  int _pnfs_return_layout(struct inode *);
>>> -int pnfs_ld_write_done(struct nfs_write_data *);
>>> +void pnfs_ld_write_done(struct nfs_write_data *);
>>>  int pnfs_ld_read_done(struct nfs_read_data *);
>>>  struct pnfs_layout_segment *pnfs_update_layout(struct inode *ino,
>>>                                               struct nfs_open_context *ctx,
>>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>>> index b39b37f..26b8d90 100644
>>> --- a/fs/nfs/write.c
>>> +++ b/fs/nfs/write.c
>>> @@ -1165,7 +1165,13 @@ static void nfs_writeback_done_full(struct rpc_task *task, void *calldata)
>>>  static void nfs_writeback_release_full(void *calldata)
>>>  {
>>>        struct nfs_write_data   *data = calldata;
>>> -       int status = data->task.tk_status;
>>> +       int ret, status = data->task.tk_status;
>>> +       struct nfs_pageio_descriptor pgio;
>>> +
>>> +       if (data->pnfs_error) {
>>> +               nfs_pageio_init_write_mds(&pgio, data->inode, FLUSH_STABLE);
>>> +               pgio.pg_recoalesce = 1;
>>> +       }
>>>
>>>        /* Update attributes as result of writeback. */
>>>        while (!list_empty(&data->pages)) {
>>> @@ -1181,6 +1187,11 @@ static void nfs_writeback_release_full(void *calldata)
>>>                        req->wb_bytes,
>>>                        (long long)req_offset(req));
>>>
>>> +               if (data->pnfs_error) {
>>> +                       dprintk(", pnfs error = %d\n", data->pnfs_error);
>>> +                       goto next;
>>> +               }
>>> +
>>>                if (status < 0) {
>>>                        nfs_set_pageerror(page);
>>>                        nfs_context_set_write_error(req->wb_context, status);
>>> @@ -1200,7 +1211,19 @@ remove_request:
>>>        next:
>>>                nfs_clear_page_tag_locked(req);
>>>                nfs_end_page_writeback(page);
>>> +               if (data->pnfs_error) {
>>> +                       lock_page(page);
>>> +                       nfs_pageio_cond_complete(&pgio, page->index);
>>> +                       ret = nfs_page_async_flush(&pgio, page, 0);
>>> +                       if (ret) {
>>> +                               nfs_set_pageerror(page);
>> My only concern about the patch is here. If we fail at
>> nfs_page_async_flush(), is it enough to just call nfs_set_pageerror()
>> on the page? Do we need to redirty the page as well, and maybe some
>> other things?
>>
>
> Hum? That's a very good question. nfs_end_page_writeback above does
> an end_page_writeback(), which does test_clear_page_writeback().
> Now in nfs_page_async_flush() you are adding the page back to the nfs writeback
> queue. So the question is when does test_set_page_writeback() is called?
> If it is called as part of the re-issue of write by NFS then you are
> fine because you clear it here but it will be set later on. But if
> it was done on entry into the queue. At the original sight of the call to
> nfs_page_async_flush(). Then it will not be set again and it might confuse
> the VFS since we will be re-clearing a cleared bit.
nfs_set_page_tag_locked() and nfs_set_page_writeback() are called
inside nfs_page_async_flush(), so we need to call
nfs_clear_page_tag_locked() and nfs_end_page_writeback() before
invoking nfs_page_async_flush(). But I am not quite sure about how to
handle nfs_page_async_flush() failure properly here (e.g., OOM). If we
don't re-dirty the page, will it break NFS writeback assumptions?

Thanks,
Tao
>
> So in short we should investigate if nfs_clear_page_tag_locked() &&
> nfs_end_page_writeback() sould only be called at the *else* clause
> of the "if (unlikely(data->pnfs_error))" above.
>
> I'll be testing this next week and we'll see.
>
> Trond do you know?
>
> Boaz
>
>
>> Thanks,
>> Tao
>>> +                               dprintk(", error = %d\n", ret);
>>> +                       }
>>> +                       unlock_page(page);
>>> +               }
>>>        }
>>> +       if (data->pnfs_error)
>>> +               nfs_pageio_complete(&pgio);
>>>        nfs_writedata_release(calldata);
>>>  }
>>>
>>> --
>>> 1.7.1.262.g5ef3d
>>>
>>>
>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails
  2011-08-27 10:47       ` Peng Tao
@ 2011-08-29 21:16         ` Boaz Harrosh
  0 siblings, 0 replies; 12+ messages in thread
From: Boaz Harrosh @ 2011-08-29 21:16 UTC (permalink / raw)
  To: Peng Tao; +Cc: Trond Myklebust, benny, linux-nfs, Peng Tao

On 08/27/2011 03:47 AM, Peng Tao wrote:
> Hi, Boaz,
> 
> On Sat, Aug 27, 2011 at 10:43 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
>>> My only concern about the patch is here. If we fail at
>>> nfs_page_async_flush(), is it enough to just call nfs_set_pageerror()
>>> on the page? Do we need to redirty the page as well, and maybe some
>>> other things?
>>>
>>
>> Hum? That's a very good question. nfs_end_page_writeback above does
>> an end_page_writeback(), which does test_clear_page_writeback().
>> Now in nfs_page_async_flush() you are adding the page back to the nfs writeback
>> queue. So the question is when does test_set_page_writeback() is called?
>> If it is called as part of the re-issue of write by NFS then you are
>> fine because you clear it here but it will be set later on. But if
>> it was done on entry into the queue. At the original sight of the call to
>> nfs_page_async_flush(). Then it will not be set again and it might confuse
>> the VFS since we will be re-clearing a cleared bit.
>
> nfs_set_page_tag_locked() and nfs_set_page_writeback() are called
> inside nfs_page_async_flush(), so we need to call
> nfs_clear_page_tag_locked() and nfs_end_page_writeback() before
> invoking nfs_page_async_flush(). 

OK grate, so you answered my question, that's good then.

> But I am not quite sure about how to
> handle nfs_page_async_flush() failure properly here (e.g., OOM). If we
> don't re-dirty the page, will it break NFS writeback assumptions?
> 

Hmm, that one is tough. (From the frying-pan into the fire).

Lets add a big FIXME: for now. And later after lots of testing and error
injection we might just check nfs_page_async_flush() return code and
manually re-dirty the page on failure.

> Thanks,
> Tao
>>

Thanks
Boaz

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-08-29 21:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-13  1:00 [PATCH v2 0/3] recoalesce when ld read/write fails Peng Tao
2011-08-13  1:00 ` [PATCH v2 1/3] pNFS: recoalesce when ld write pagelist fails Peng Tao
2011-08-23 14:45   ` Peng Tao
2011-08-27  2:43     ` Boaz Harrosh
2011-08-27 10:47       ` Peng Tao
2011-08-29 21:16         ` Boaz Harrosh
2011-08-13  1:00 ` [PATCH v2 2/3] pNFS: recoalesce when ld read " Peng Tao
2011-08-13  1:00 ` [PATCH v2 3/3] pNFS: introduce pnfs private workqueue Peng Tao
2011-08-16 20:43 ` [PATCH v2 0/3] recoalesce when ld read/write fails Boaz Harrosh
2011-08-17 13:42   ` Peng Tao
2011-08-22 23:24     ` Boaz Harrosh
2011-08-23 14:31       ` Peng Tao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.