All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs
@ 2012-06-01 17:19 andros
  2012-06-01 17:19 ` [PATCH 2/3] NFSv4.1 mark layout when already returned andros
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: andros @ 2012-06-01 17:19 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-nfs, Andy Adamson

From: Andy Adamson <andros@netapp.com>

mark_matching_lsegs_invalid() does not indicate if the plh_segs list is
empty on entrance - and it can be empty on exit.

When the plh_segs list is emtpy, mark_matching_lsegs_invalid removes
the pnfs_layout_hdr creation reference.

Signed-off-by: Andy Adamson <andros@netapp.com>
---
 fs/nfs/pnfs.c |   17 +++++++++++++++--
 1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index b8323aa..854df5e 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -646,7 +646,14 @@ out_err_free:
 	return NULL;
 }
 
-/* Initiates a LAYOUTRETURN(FILE) */
+/*
+ * Initiates a LAYOUTRETURN(FILE), and removes the pnfs_layout_hdr
+ * when the layout segment list is empty.
+ *
+ * Note that a pnfs_layout_hdr can exist with an empty layout segment
+ * list when LAYOUTGET has failed, or when LAYOUTGET succeeded, but the
+ * deviceid is marked invalid.
+ */
 int
 _pnfs_return_layout(struct inode *ino)
 {
@@ -655,7 +662,7 @@ _pnfs_return_layout(struct inode *ino)
 	LIST_HEAD(tmp_list);
 	struct nfs4_layoutreturn *lrp;
 	nfs4_stateid stateid;
-	int status = 0;
+	int status = 0, empty = 0;
 
 	dprintk("--> %s\n", __func__);
 
@@ -669,11 +676,17 @@ _pnfs_return_layout(struct inode *ino)
 	stateid = nfsi->layout->plh_stateid;
 	/* Reference matched in nfs4_layoutreturn_release */
 	get_layout_hdr(lo);
+	empty = list_empty(&lo->plh_segs);
 	mark_matching_lsegs_invalid(lo, &tmp_list, NULL);
 	lo->plh_block_lgets++;
 	spin_unlock(&ino->i_lock);
 	pnfs_free_lseg_list(&tmp_list);
 
+	if (empty) {
+		put_layout_hdr(lo);
+		dprintk("%s: no layout segments to return\n", __func__);
+		return status;
+	}
 	WARN_ON(test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags));
 
 	lrp = kzalloc(sizeof(*lrp), GFP_KERNEL);
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-01 17:19 [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs andros
@ 2012-06-01 17:19 ` andros
  2012-06-02 22:51   ` Boaz Harrosh
  2012-06-01 17:19 ` [PATCH 3/3] NFSv4.1 fence all layouts with file layout data server connection errors andros
  2012-06-02 23:00 ` [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs Boaz Harrosh
  2 siblings, 1 reply; 26+ messages in thread
From: andros @ 2012-06-01 17:19 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-nfs, Andy Adamson

From: Andy Adamson <andros@netapp.com>

When fencing, pnfs_return_layout can be called mulitple times with in-flight
i/o referencing lsegs on it's plh_segs list.

Remember that LAYOUTRETURN has been called, and do not call it again.

Signed-off-by: Andy Adamson <andros@netapp.com>
---
 fs/nfs/pnfs.c |    6 ++++--
 fs/nfs/pnfs.h |   13 +++++++++++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 854df5e..9ffd5f8 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -664,11 +664,11 @@ _pnfs_return_layout(struct inode *ino)
 	nfs4_stateid stateid;
 	int status = 0, empty = 0;
 
-	dprintk("--> %s\n", __func__);
+	dprintk("--> %s for inode %lu\n", __func__, ino->i_ino);
 
 	spin_lock(&ino->i_lock);
 	lo = nfsi->layout;
-	if (!lo) {
+	if (!lo || pnfs_test_layout_returned(lo)) {
 		spin_unlock(&ino->i_lock);
 		dprintk("%s: no layout to return\n", __func__);
 		return status;
@@ -705,6 +705,8 @@ _pnfs_return_layout(struct inode *ino)
 	lrp->clp = NFS_SERVER(ino)->nfs_client;
 
 	status = nfs4_proc_layoutreturn(lrp);
+	if (status == 0)
+		pnfs_mark_layout_returned(lo);
 out:
 	dprintk("<-- %s status: %d\n", __func__, status);
 	return status;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 29fd23c..8be6de2 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -64,6 +64,7 @@ enum {
 	NFS_LAYOUT_ROC,			/* some lseg had roc bit set */
 	NFS_LAYOUT_DESTROYED,		/* no new use of layout allowed */
 	NFS_LAYOUT_INVALID,		/* layout is being destroyed */
+	NFS_LAYOUT_RETURNED,		/* layout has already been returned */
 };
 
 enum layoutdriver_policy_flags {
@@ -255,6 +256,18 @@ struct nfs4_deviceid_node *nfs4_insert_deviceid_node(struct nfs4_deviceid_node *
 bool nfs4_put_deviceid_node(struct nfs4_deviceid_node *);
 void nfs4_deviceid_purge_client(const struct nfs_client *);
 
+static inline void
+pnfs_mark_layout_returned(struct pnfs_layout_hdr *lo)
+{
+	set_bit(NFS_LAYOUT_RETURNED, &lo->plh_flags);
+}
+
+static inline bool
+pnfs_test_layout_returned(struct pnfs_layout_hdr *lo)
+{
+	return test_bit(NFS_LAYOUT_RETURNED, &lo->plh_flags);
+}
+
 static inline int lo_fail_bit(u32 iomode)
 {
 	return iomode == IOMODE_RW ?
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] NFSv4.1 fence all layouts with file layout data server connection errors
  2012-06-01 17:19 [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs andros
  2012-06-01 17:19 ` [PATCH 2/3] NFSv4.1 mark layout when already returned andros
@ 2012-06-01 17:19 ` andros
  2012-06-02 22:33   ` Boaz Harrosh
  2012-06-02 23:00 ` [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs Boaz Harrosh
  2 siblings, 1 reply; 26+ messages in thread
From: andros @ 2012-06-01 17:19 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-nfs, Andy Adamson

From: Andy Adamson <andros@netapp.com>

Clear the inode LAYOUTCOMMIT flag as we will resend writes through the MDS.

Signed-off-by: Andy Adamson <andros@netapp.com>
---
 fs/nfs/nfs4filelayout.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index e134029..96a75d1 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -205,8 +205,8 @@ static int filelayout_async_handle_error(struct rpc_task *task,
 	case -EPIPE:
 		dprintk("%s DS connection error %d\n", __func__,
 			task->tk_status);
-		if (!filelayout_test_devid_invalid(devid))
-			_pnfs_return_layout(inode);
+		clear_bit(NFS_INO_LAYOUTCOMMIT, &NFS_I(inode)->flags);
+		_pnfs_return_layout(inode);
 		filelayout_mark_devid_invalid(devid);
 		rpc_wake_up(&tbl->slot_tbl_waitq);
 		nfs4_ds_disconnect(clp);
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] NFSv4.1 fence all layouts with file layout data server connection errors
  2012-06-01 17:19 ` [PATCH 3/3] NFSv4.1 fence all layouts with file layout data server connection errors andros
@ 2012-06-02 22:33   ` Boaz Harrosh
  2012-06-04 13:49     ` Adamson, Andy
  0 siblings, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-02 22:33 UTC (permalink / raw)
  To: andros; +Cc: trond.myklebust, linux-nfs

On 06/01/2012 08:19 PM, andros@netapp.com wrote:

> From: Andy Adamson <andros@netapp.com>
> 


The tittle does not match at all the message text nor the patch.

More like: Don't send LAYOUTCOMMIT if data is rewritten through MDS 

I think.
Boaz

> Clear the inode LAYOUTCOMMIT flag as we will resend writes through the MDS.
> 
> Signed-off-by: Andy Adamson <andros@netapp.com>
> ---
>  fs/nfs/nfs4filelayout.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
> index e134029..96a75d1 100644
> --- a/fs/nfs/nfs4filelayout.c
> +++ b/fs/nfs/nfs4filelayout.c
> @@ -205,8 +205,8 @@ static int filelayout_async_handle_error(struct rpc_task *task,
>  	case -EPIPE:
>  		dprintk("%s DS connection error %d\n", __func__,
>  			task->tk_status);
> -		if (!filelayout_test_devid_invalid(devid))
> -			_pnfs_return_layout(inode);
> +		clear_bit(NFS_INO_LAYOUTCOMMIT, &NFS_I(inode)->flags);
> +		_pnfs_return_layout(inode);
>  		filelayout_mark_devid_invalid(devid);
>  		rpc_wake_up(&tbl->slot_tbl_waitq);
>  		nfs4_ds_disconnect(clp);



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-01 17:19 ` [PATCH 2/3] NFSv4.1 mark layout when already returned andros
@ 2012-06-02 22:51   ` Boaz Harrosh
  2012-06-05 13:36     ` Adamson, Andy
  0 siblings, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-02 22:51 UTC (permalink / raw)
  To: andros; +Cc: trond.myklebust, linux-nfs

On 06/01/2012 08:19 PM, andros@netapp.com wrote:

> From: Andy Adamson <andros@netapp.com>
> 
> When fencing, pnfs_return_layout can be called mulitple times with in-flight
> i/o referencing lsegs on it's plh_segs list.
> 
> Remember that LAYOUTRETURN has been called, and do not call it again.
> 


NACK

In objects-layout we must report all errors on layout_return. We
accumulate them and report of all errors at once. So we need
the return after all in flights are back. (And no new IOs are
sent) Otherwise we might miss some.

Also the RFC mandates that we do not use any layout or have
IOs in flight, once we return the layout.

I was under the impression that only when the last reference
on a layout is dropped only then we send the lo_return.
If it is not so, this is the proper fix.

1. Mark LO invalid so all new IO waits or goes to MDS
3. When LO ref drops to Zero send the lo_return.
4. After LAYOUTRETURN_DONE is back, re-enable layouts.

I'm so sorry it is not so today. I should have tested for
this. I admit that all my error injection tests are
single-file single thread so I did not test for this.

Sigh, work never ends. Tell me if I can help with this
Boaz

> Signed-off-by: Andy Adamson <andros@netapp.com>
> ---
>  fs/nfs/pnfs.c |    6 ++++--
>  fs/nfs/pnfs.h |   13 +++++++++++++
>  2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 854df5e..9ffd5f8 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -664,11 +664,11 @@ _pnfs_return_layout(struct inode *ino)
>  	nfs4_stateid stateid;
>  	int status = 0, empty = 0;
>  
> -	dprintk("--> %s\n", __func__);
> +	dprintk("--> %s for inode %lu\n", __func__, ino->i_ino);
>  
>  	spin_lock(&ino->i_lock);
>  	lo = nfsi->layout;
> -	if (!lo) {
> +	if (!lo || pnfs_test_layout_returned(lo)) {
>  		spin_unlock(&ino->i_lock);
>  		dprintk("%s: no layout to return\n", __func__);
>  		return status;
> @@ -705,6 +705,8 @@ _pnfs_return_layout(struct inode *ino)
>  	lrp->clp = NFS_SERVER(ino)->nfs_client;
>  
>  	status = nfs4_proc_layoutreturn(lrp);
> +	if (status == 0)
> +		pnfs_mark_layout_returned(lo);
>  out:
>  	dprintk("<-- %s status: %d\n", __func__, status);
>  	return status;
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 29fd23c..8be6de2 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -64,6 +64,7 @@ enum {
>  	NFS_LAYOUT_ROC,			/* some lseg had roc bit set */
>  	NFS_LAYOUT_DESTROYED,		/* no new use of layout allowed */
>  	NFS_LAYOUT_INVALID,		/* layout is being destroyed */
> +	NFS_LAYOUT_RETURNED,		/* layout has already been returned */
>  };
>  
>  enum layoutdriver_policy_flags {
> @@ -255,6 +256,18 @@ struct nfs4_deviceid_node *nfs4_insert_deviceid_node(struct nfs4_deviceid_node *
>  bool nfs4_put_deviceid_node(struct nfs4_deviceid_node *);
>  void nfs4_deviceid_purge_client(const struct nfs_client *);
>  
> +static inline void
> +pnfs_mark_layout_returned(struct pnfs_layout_hdr *lo)
> +{
> +	set_bit(NFS_LAYOUT_RETURNED, &lo->plh_flags);
> +}
> +
> +static inline bool
> +pnfs_test_layout_returned(struct pnfs_layout_hdr *lo)
> +{
> +	return test_bit(NFS_LAYOUT_RETURNED, &lo->plh_flags);
> +}
> +
>  static inline int lo_fail_bit(u32 iomode)
>  {
>  	return iomode == IOMODE_RW ?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs
  2012-06-01 17:19 [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs andros
  2012-06-01 17:19 ` [PATCH 2/3] NFSv4.1 mark layout when already returned andros
  2012-06-01 17:19 ` [PATCH 3/3] NFSv4.1 fence all layouts with file layout data server connection errors andros
@ 2012-06-02 23:00 ` Boaz Harrosh
  2012-06-05 13:36   ` Adamson, Andy
  2 siblings, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-02 23:00 UTC (permalink / raw)
  To: andros; +Cc: trond.myklebust, linux-nfs

On 06/01/2012 08:19 PM, andros@netapp.com wrote:

> From: Andy Adamson <andros@netapp.com>
> 


I wish you would send your patches preceded with a [PATCHSET 0/x]
That gives us a short summery and a little background on its motivation.

For example here, what are the problems you encounter, how did you test
it and so on...

Do these actually depend and/or fix your previous "send layout_return ..."
In files layout?

(You can still send all patches by git send-email command line just the same
 just add a [PATCHSET 0/x]-xxx-yyy.txt and it will get sent first, right?)

Thanks
Boaz

> mark_matching_lsegs_invalid() does not indicate if the plh_segs list is
> empty on entrance - and it can be empty on exit.
> 
> When the plh_segs list is emtpy, mark_matching_lsegs_invalid removes
> the pnfs_layout_hdr creation reference.
> 
> Signed-off-by: Andy Adamson <andros@netapp.com>
> ---
>  fs/nfs/pnfs.c |   17 +++++++++++++++--
>  1 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index b8323aa..854df5e 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -646,7 +646,14 @@ out_err_free:
>  	return NULL;
>  }
>  
> -/* Initiates a LAYOUTRETURN(FILE) */
> +/*
> + * Initiates a LAYOUTRETURN(FILE), and removes the pnfs_layout_hdr
> + * when the layout segment list is empty.
> + *
> + * Note that a pnfs_layout_hdr can exist with an empty layout segment
> + * list when LAYOUTGET has failed, or when LAYOUTGET succeeded, but the
> + * deviceid is marked invalid.
> + */
>  int
>  _pnfs_return_layout(struct inode *ino)
>  {
> @@ -655,7 +662,7 @@ _pnfs_return_layout(struct inode *ino)
>  	LIST_HEAD(tmp_list);
>  	struct nfs4_layoutreturn *lrp;
>  	nfs4_stateid stateid;
> -	int status = 0;
> +	int status = 0, empty = 0;
>  
>  	dprintk("--> %s\n", __func__);
>  
> @@ -669,11 +676,17 @@ _pnfs_return_layout(struct inode *ino)
>  	stateid = nfsi->layout->plh_stateid;
>  	/* Reference matched in nfs4_layoutreturn_release */
>  	get_layout_hdr(lo);
> +	empty = list_empty(&lo->plh_segs);
>  	mark_matching_lsegs_invalid(lo, &tmp_list, NULL);
>  	lo->plh_block_lgets++;
>  	spin_unlock(&ino->i_lock);
>  	pnfs_free_lseg_list(&tmp_list);
>  
> +	if (empty) {
> +		put_layout_hdr(lo);
> +		dprintk("%s: no layout segments to return\n", __func__);
> +		return status;
> +	}
>  	WARN_ON(test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags));
>  
>  	lrp = kzalloc(sizeof(*lrp), GFP_KERNEL);



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] NFSv4.1 fence all layouts with file layout data server connection errors
  2012-06-02 22:33   ` Boaz Harrosh
@ 2012-06-04 13:49     ` Adamson, Andy
  0 siblings, 0 replies; 26+ messages in thread
From: Adamson, Andy @ 2012-06-04 13:49 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Myklebust, Trond, <linux-nfs@vger.kernel.org>


On Jun 2, 2012, at 6:33 PM, Boaz Harrosh wrote:

> On 06/01/2012 08:19 PM, andros@netapp.com wrote:
> 
>> From: Andy Adamson <andros@netapp.com>
>> 
> 
> 
> The tittle does not match at all the message text nor the patch.
> 
> More like: Don't send LAYOUTCOMMIT if data is rewritten through MDS 

Well, the meat of this patch is the LAYOUTRETURN for fencing.

-->Andy

> 
> I think.
> Boaz
> 
>> Clear the inode LAYOUTCOMMIT flag as we will resend writes through the MDS.
>> 
>> Signed-off-by: Andy Adamson <andros@netapp.com>
>> ---
>> fs/nfs/nfs4filelayout.c |    4 ++--
>> 1 files changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>> index e134029..96a75d1 100644
>> --- a/fs/nfs/nfs4filelayout.c
>> +++ b/fs/nfs/nfs4filelayout.c
>> @@ -205,8 +205,8 @@ static int filelayout_async_handle_error(struct rpc_task *task,
>> 	case -EPIPE:
>> 		dprintk("%s DS connection error %d\n", __func__,
>> 			task->tk_status);
>> -		if (!filelayout_test_devid_invalid(devid))
>> -			_pnfs_return_layout(inode);
>> +		clear_bit(NFS_INO_LAYOUTCOMMIT, &NFS_I(inode)->flags);
>> +		_pnfs_return_layout(inode);
>> 		filelayout_mark_devid_invalid(devid);
>> 		rpc_wake_up(&tbl->slot_tbl_waitq);
>> 		nfs4_ds_disconnect(clp);
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs
  2012-06-02 23:00 ` [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs Boaz Harrosh
@ 2012-06-05 13:36   ` Adamson, Andy
  2012-06-05 15:01     ` Boaz Harrosh
  0 siblings, 1 reply; 26+ messages in thread
From: Adamson, Andy @ 2012-06-05 13:36 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Myklebust, Trond, <linux-nfs@vger.kernel.org>


On Jun 2, 2012, at 7:00 PM, Boaz Harrosh wrote:

> On 06/01/2012 08:19 PM, andros@netapp.com wrote:
> 
>> From: Andy Adamson <andros@netapp.com>
>> 
> 
> 
> I wish you would send your patches preceded with a [PATCHSET 0/x]
> That gives us a short summery and a little background on its motivation.

I often do. These patches actually stand alone so I didn't this time. 

This patch is totally explained by the patch title, patch comments, and in-line comments. With out this patch, we can end up
sending a LAYOUTRETURN with no layout segments that need to be returned! Why? because as noted in the patch comments, mark_matching_lsegs_invalid does not  indicate when the plh_segs list is empty…etc. When? well, as noted in the in-line code comments, when the pnfs_layout_hdr has an empty plh_segs which an occur when a LAYOUTGET fails, or when LAYOUTGET succeeded, but the deviceid is marked invalid

What else is there to say?!

I guess I can add the --cover-letter to the git format-patch and cut/paste to it.

-->Andy

> 
> For example here, what are the problems you encounter, how did you test
> it and so on...
> 
> Do these actually depend and/or fix your previous "send layout_return ..."
> In files layout?
> 
> (You can still send all patches by git send-email command line just the same
> just add a [PATCHSET 0/x]-xxx-yyy.txt and it will get sent first, right?)
> 
> Thanks
> Boaz
> 
>> mark_matching_lsegs_invalid() does not indicate if the plh_segs list is
>> empty on entrance - and it can be empty on exit.
>> 
>> When the plh_segs list is emtpy, mark_matching_lsegs_invalid removes
>> the pnfs_layout_hdr creation reference.
>> 
>> Signed-off-by: Andy Adamson <andros@netapp.com>
>> ---
>> fs/nfs/pnfs.c |   17 +++++++++++++++--
>> 1 files changed, 15 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index b8323aa..854df5e 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -646,7 +646,14 @@ out_err_free:
>> 	return NULL;
>> }
>> 
>> -/* Initiates a LAYOUTRETURN(FILE) */
>> +/*
>> + * Initiates a LAYOUTRETURN(FILE), and removes the pnfs_layout_hdr
>> + * when the layout segment list is empty.
>> + *
>> + * Note that a pnfs_layout_hdr can exist with an empty layout segment
>> + * list when LAYOUTGET has failed, or when LAYOUTGET succeeded, but the
>> + * deviceid is marked invalid.
>> + */
>> int
>> _pnfs_return_layout(struct inode *ino)
>> {
>> @@ -655,7 +662,7 @@ _pnfs_return_layout(struct inode *ino)
>> 	LIST_HEAD(tmp_list);
>> 	struct nfs4_layoutreturn *lrp;
>> 	nfs4_stateid stateid;
>> -	int status = 0;
>> +	int status = 0, empty = 0;
>> 
>> 	dprintk("--> %s\n", __func__);
>> 
>> @@ -669,11 +676,17 @@ _pnfs_return_layout(struct inode *ino)
>> 	stateid = nfsi->layout->plh_stateid;
>> 	/* Reference matched in nfs4_layoutreturn_release */
>> 	get_layout_hdr(lo);
>> +	empty = list_empty(&lo->plh_segs);
>> 	mark_matching_lsegs_invalid(lo, &tmp_list, NULL);
>> 	lo->plh_block_lgets++;
>> 	spin_unlock(&ino->i_lock);
>> 	pnfs_free_lseg_list(&tmp_list);
>> 
>> +	if (empty) {
>> +		put_layout_hdr(lo);
>> +		dprintk("%s: no layout segments to return\n", __func__);
>> +		return status;
>> +	}
>> 	WARN_ON(test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags));
>> 
>> 	lrp = kzalloc(sizeof(*lrp), GFP_KERNEL);
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-02 22:51   ` Boaz Harrosh
@ 2012-06-05 13:36     ` Adamson, Andy
  2012-06-05 13:47       ` Andy Adamson
  2012-06-05 14:54       ` Boaz Harrosh
  0 siblings, 2 replies; 26+ messages in thread
From: Adamson, Andy @ 2012-06-05 13:36 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Myklebust, Trond, <linux-nfs@vger.kernel.org>


On Jun 2, 2012, at 6:51 PM, Boaz Harrosh wrote:

> On 06/01/2012 08:19 PM, andros@netapp.com wrote:
> 
>> From: Andy Adamson <andros@netapp.com>
>> 
>> When fencing, pnfs_return_layout can be called mulitple times with in-flight
>> i/o referencing lsegs on it's plh_segs list.
>> 
>> Remember that LAYOUTRETURN has been called, and do not call it again.
>> 
> 
> 
> NACK
> 
> In objects-layout we must report all errors on layout_return. We
> accumulate them and report of all errors at once. So we need
> the return after all in flights are back. (And no new IOs are
> sent) Otherwise we might miss some.

_pnfs_retrun_layout removes all layouts, and should therefore only be called once. I'll can add the 'wait for all in-flight' functionality, and we can switch behaviors (wait or not wait).

> Also the RFC mandates that we do not use any layout or have
> IOs in flight, once we return the layout.


You are referring to this piece of the spec?
Section 18.44.3   

  After this call,
   the client MUST NOT use the returned layout(s) and the associated
   storage protocol to access the file data.


The above says that  the client MUST NOT send any _new_ i/o using the layout.  I don't see any reference to in-flight i/o, nor should there be in the error case. I get a connection error. Did the i/o's I sent get to the data server?  The reason to send a LAYOUTRETURN without waiting for all the in-flights to return with a connection error is precisely to fence any in-flight i/o because I'm resending through the MDS.


> 
> I was under the impression that only when the last reference
> on a layout is dropped only then we send the lo_return.

> If it is not so, this is the proper fix.

> 
> 1. Mark LO invalid so all new IO waits or goes to MDS
> 3. When LO ref drops to Zero send the lo_return.

Yes for the normal case (evict inode for the file layout), but not if I'm in an error situation and I want to fence the DS from in-flight i/o.

> 4. After LAYOUTRETURN_DONE is back, re-enable layouts.
> I'm so sorry it is not so today. I should have tested for
> this. I admit that all my error injection tests are
> single-file single thread so I did not test for this.
> 
> Sigh, work never ends. Tell me if I can help with this

I'll add the wait/no-wait…

-->Andy

> Boaz
> 
>> Signed-off-by: Andy Adamson <andros@netapp.com>
>> ---
>> fs/nfs/pnfs.c |    6 ++++--
>> fs/nfs/pnfs.h |   13 +++++++++++++
>> 2 files changed, 17 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 854df5e..9ffd5f8 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -664,11 +664,11 @@ _pnfs_return_layout(struct inode *ino)
>> 	nfs4_stateid stateid;
>> 	int status = 0, empty = 0;
>> 
>> -	dprintk("--> %s\n", __func__);
>> +	dprintk("--> %s for inode %lu\n", __func__, ino->i_ino);
>> 
>> 	spin_lock(&ino->i_lock);
>> 	lo = nfsi->layout;
>> -	if (!lo) {
>> +	if (!lo || pnfs_test_layout_returned(lo)) {
>> 		spin_unlock(&ino->i_lock);
>> 		dprintk("%s: no layout to return\n", __func__);
>> 		return status;
>> @@ -705,6 +705,8 @@ _pnfs_return_layout(struct inode *ino)
>> 	lrp->clp = NFS_SERVER(ino)->nfs_client;
>> 
>> 	status = nfs4_proc_layoutreturn(lrp);
>> +	if (status == 0)
>> +		pnfs_mark_layout_returned(lo);
>> out:
>> 	dprintk("<-- %s status: %d\n", __func__, status);
>> 	return status;
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index 29fd23c..8be6de2 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -64,6 +64,7 @@ enum {
>> 	NFS_LAYOUT_ROC,			/* some lseg had roc bit set */
>> 	NFS_LAYOUT_DESTROYED,		/* no new use of layout allowed */
>> 	NFS_LAYOUT_INVALID,		/* layout is being destroyed */
>> +	NFS_LAYOUT_RETURNED,		/* layout has already been returned */
>> };
>> 
>> enum layoutdriver_policy_flags {
>> @@ -255,6 +256,18 @@ struct nfs4_deviceid_node *nfs4_insert_deviceid_node(struct nfs4_deviceid_node *
>> bool nfs4_put_deviceid_node(struct nfs4_deviceid_node *);
>> void nfs4_deviceid_purge_client(const struct nfs_client *);
>> 
>> +static inline void
>> +pnfs_mark_layout_returned(struct pnfs_layout_hdr *lo)
>> +{
>> +	set_bit(NFS_LAYOUT_RETURNED, &lo->plh_flags);
>> +}
>> +
>> +static inline bool
>> +pnfs_test_layout_returned(struct pnfs_layout_hdr *lo)
>> +{
>> +	return test_bit(NFS_LAYOUT_RETURNED, &lo->plh_flags);
>> +}
>> +
>> static inline int lo_fail_bit(u32 iomode)
>> {
>> 	return iomode == IOMODE_RW ?
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-05 13:36     ` Adamson, Andy
@ 2012-06-05 13:47       ` Andy Adamson
  2012-06-05 14:54       ` Boaz Harrosh
  1 sibling, 0 replies; 26+ messages in thread
From: Andy Adamson @ 2012-06-05 13:47 UTC (permalink / raw)
  To: Adamson, Andy
  Cc: Boaz Harrosh, Myklebust, Trond, <linux-nfs@vger.kernel.org>

On Tue, Jun 5, 2012 at 9:36 AM, Adamson, Andy
<William.Adamson@netapp.com> wrote:
>
> On Jun 2, 2012, at 6:51 PM, Boaz Harrosh wrote:
>
>> On 06/01/2012 08:19 PM, andros@netapp.com wrote:
>>
>>> From: Andy Adamson <andros@netapp.com>
>>>
>>> When fencing, pnfs_return_layout can be called mulitple times with in-flight
>>> i/o referencing lsegs on it's plh_segs list.
>>>
>>> Remember that LAYOUTRETURN has been called, and do not call it again.
>>>
>>
>>
>> NACK
>>
>> In objects-layout we must report all errors on layout_return. We
>> accumulate them and report of all errors at once. So we need
>> the return after all in flights are back. (And no new IOs are
>> sent) Otherwise we might miss some.
>
> _pnfs_retrun_layout removes all layouts, and should therefore only be called once.

To clarify: There is more work to be done here - the flag should be
cleared once the plh_segs list is empty....

I'll resend, and also fix the "wait for all in-flight" normal behavior.

Thanks for the comments

-->Andy

> I'll can add the 'wait for all in-flight' functionality, and we can switch behaviors (wait or not > wait).
>
>> Also the RFC mandates that we do not use any layout or have
>> IOs in flight, once we return the layout.
>
>
> You are referring to this piece of the spec?
> Section 18.44.3
>
>  After this call,
>   the client MUST NOT use the returned layout(s) and the associated
>   storage protocol to access the file data.
>
>
> The above says that  the client MUST NOT send any _new_ i/o using the layout.  I don't see any reference to in-flight i/o, nor should there be in the error case. I get a connection error. Did the i/o's I sent get to the data server?  The reason to send a LAYOUTRETURN without waiting for all the in-flights to return with a connection error is precisely to fence any in-flight i/o because I'm resending through the MDS.
>
>
>>
>> I was under the impression that only when the last reference
>> on a layout is dropped only then we send the lo_return.
>
>> If it is not so, this is the proper fix.
>
>>
>> 1. Mark LO invalid so all new IO waits or goes to MDS
>> 3. When LO ref drops to Zero send the lo_return.
>
> Yes for the normal case (evict inode for the file layout), but not if I'm in an error situation and I want to fence the DS from in-flight i/o.
>
>> 4. After LAYOUTRETURN_DONE is back, re-enable layouts.
>> I'm so sorry it is not so today. I should have tested for
>> this. I admit that all my error injection tests are
>> single-file single thread so I did not test for this.
>>
>> Sigh, work never ends. Tell me if I can help with this
>
> I'll add the wait/no-wait…
>
> -->Andy
>
>> Boaz
>>
>>> Signed-off-by: Andy Adamson <andros@netapp.com>
>>> ---
>>> fs/nfs/pnfs.c |    6 ++++--
>>> fs/nfs/pnfs.h |   13 +++++++++++++
>>> 2 files changed, 17 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>>> index 854df5e..9ffd5f8 100644
>>> --- a/fs/nfs/pnfs.c
>>> +++ b/fs/nfs/pnfs.c
>>> @@ -664,11 +664,11 @@ _pnfs_return_layout(struct inode *ino)
>>>      nfs4_stateid stateid;
>>>      int status = 0, empty = 0;
>>>
>>> -    dprintk("--> %s\n", __func__);
>>> +    dprintk("--> %s for inode %lu\n", __func__, ino->i_ino);
>>>
>>>      spin_lock(&ino->i_lock);
>>>      lo = nfsi->layout;
>>> -    if (!lo) {
>>> +    if (!lo || pnfs_test_layout_returned(lo)) {
>>>              spin_unlock(&ino->i_lock);
>>>              dprintk("%s: no layout to return\n", __func__);
>>>              return status;
>>> @@ -705,6 +705,8 @@ _pnfs_return_layout(struct inode *ino)
>>>      lrp->clp = NFS_SERVER(ino)->nfs_client;
>>>
>>>      status = nfs4_proc_layoutreturn(lrp);
>>> +    if (status == 0)
>>> +            pnfs_mark_layout_returned(lo);
>>> out:
>>>      dprintk("<-- %s status: %d\n", __func__, status);
>>>      return status;
>>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>>> index 29fd23c..8be6de2 100644
>>> --- a/fs/nfs/pnfs.h
>>> +++ b/fs/nfs/pnfs.h
>>> @@ -64,6 +64,7 @@ enum {
>>>      NFS_LAYOUT_ROC,                 /* some lseg had roc bit set */
>>>      NFS_LAYOUT_DESTROYED,           /* no new use of layout allowed */
>>>      NFS_LAYOUT_INVALID,             /* layout is being destroyed */
>>> +    NFS_LAYOUT_RETURNED,            /* layout has already been returned */
>>> };
>>>
>>> enum layoutdriver_policy_flags {
>>> @@ -255,6 +256,18 @@ struct nfs4_deviceid_node *nfs4_insert_deviceid_node(struct nfs4_deviceid_node *
>>> bool nfs4_put_deviceid_node(struct nfs4_deviceid_node *);
>>> void nfs4_deviceid_purge_client(const struct nfs_client *);
>>>
>>> +static inline void
>>> +pnfs_mark_layout_returned(struct pnfs_layout_hdr *lo)
>>> +{
>>> +    set_bit(NFS_LAYOUT_RETURNED, &lo->plh_flags);
>>> +}
>>> +
>>> +static inline bool
>>> +pnfs_test_layout_returned(struct pnfs_layout_hdr *lo)
>>> +{
>>> +    return test_bit(NFS_LAYOUT_RETURNED, &lo->plh_flags);
>>> +}
>>> +
>>> static inline int lo_fail_bit(u32 iomode)
>>> {
>>>      return iomode == IOMODE_RW ?
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-05 13:36     ` Adamson, Andy
  2012-06-05 13:47       ` Andy Adamson
@ 2012-06-05 14:54       ` Boaz Harrosh
  2012-06-05 19:22         ` Andy Adamson
  1 sibling, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-05 14:54 UTC (permalink / raw)
  To: Adamson, Andy; +Cc: Myklebust, Trond, <linux-nfs@vger.kernel.org>

On 06/05/2012 04:36 PM, Adamson, Andy wrote:

> On Jun 2, 2012, at 6:51 PM, Boaz Harrosh wrote:
>>
>> In objects-layout we must report all errors on layout_return. We
>> accumulate them and report of all errors at once. So we need
>> the return after all in flights are back. (And no new IOs are
>> sent) Otherwise we might miss some.
> 

> _pnfs_retrun_layout removes all layouts, and should therefore only be
> called once. 


I agree current behavior is a bug, hence my apology.

> I'll can add the 'wait for all in-flight' functionality,
> and we can switch behaviors (wait or not wait).
> 


I disagree you must "delay" the send see below.

>> Also the RFC mandates that we do not use any layout or have
>> IOs in flight, once we return the layout.
> 
> 
> You are referring to this piece of the spec?
> Section 18.44.3   
> 
>   After this call,
>    the client MUST NOT use the returned layout(s) and the associated
>    storage protocol to access the file data.
> 
> 

> The above says that the client MUST NOT send any _new_ i/o using the
> layout. I don't see any reference to in-flight i/o, 


I don't see reference to in-flight i/o either, so what does that mean?
Yes or No? It does not it's vague. For me in-flight means "USING"

Because in-flight means half was written/read and half was not, if the
lo_return was received in the middle then the half that came after was
using the layout information after the Server received an lo_return,
which clearly violates the above.

In any way, at this point in the code you do not have the information of
If the RPCs are in the middle of the transfer, hence defined as in-flight.
Or they are just inside your internal client queues and will be sent clearly
after the lo_return which surly violates the above. (Without the need of
explicit definition of in flight.)

> nor should there
> be in the error case. I get a connection error. Did the i/o's I sent
> get to the data server? 


We should always assume that statistically half was sent and half was not
In any way we will send the all "uncertain range" again.

> The reason to send a LAYOUTRETURN without
> waiting for all the in-flights to return with a connection error is
> precisely to fence any in-flight i/o because I'm resending through
> the MDS.
> 


This is clearly in contradiction of the RFC. And is imposing a Server
behavior that was not specified in RFC.

All started IO to the specified DS will return with "connection error"
pretty fast, right? because the first disconnect probably took a timeout, but
once the socket identified a disconnect it will stay in that state
until a reconnect, right.

So what are you attempting to do, Make your internal client Q drain very fast
since you are going through MDS? But you are doing that by assuming the
Server will fence ALL IO, and not by simply aborting your own Q. Highly unorthodox 
and certainly in violation of above.

> 
>>
>> I was under the impression that only when the last reference
>> on a layout is dropped only then we send the lo_return.
> 
>> If it is not so, this is the proper fix.
> 
>>
>> 1. Mark LO invalid so all new IO waits or goes to MDS
>> 3. When LO ref drops to Zero send the lo_return.
> 
> Yes for the normal case (evict inode for the file layout),

> but not if I'm in an error situation and I want to fence the DS from in-flight i/o.

A client has no way stated in the protocol to cause a "fence". This is just your
wishful thinking. lo_return is something else, lo_return means I'm no longer using
the file, not "please protect me from myself, because I will send more IO after that,
but please ignore it"

All you can do is abort your own client Q, all these RPCs that did not get sent
or errored-out will be resent through MDS, there might be half an RPC of overlap.

Think of 4.2 when you will need to report these errors to MDS - on lo_return -
Your code will not work.

Lets backtrack a second. Let me see if I understand what is your trick:
1. Say we have a file with a files-layout of which device_id specifies 3 DSs.
2. And say I have a large IO generated by an app to that file.
3. In the middle of the IO, one of the DSs, say DS-B, returns with a "connection error"
4 The RPCs stripe_units generated to DS-B will all quickly return with "connection error"
  after the first error.
5. But the RPCs to DS-A and DS-C will continue to IO. Until done.

But since you will send the entire range of IO through MDS you want that DS-A, DS-C to
reject any farther IO, and any RPCs in client Qs for DS-A and DS-C will return quickly
with "io rejected".

Is that your idea?

> 
>> 4. After LAYOUTRETURN_DONE is back, re-enable layouts.
>> I'm so sorry it is not so today. I should have tested for
>> this. I admit that all my error injection tests are
>> single-file single thread so I did not test for this.
>>
>> Sigh, work never ends. Tell me if I can help with this
> 
> I'll add the wait/no-wait…
> 


I would not want a sync wait at all. Please don't code any sync wait for me. It will not
work for objects, and will produce dead-locks.

What I need is that a flag is raised on the lo_seg, and when last ref on the
lo_seg drops an lo_return is sent. So either the call to layout_return causes
the lo_return, or the last io_done of the inflights will cause the send.
(You see io_done is the one that calls layout_return in the first place)

!! BUT Please do not do any waits at all for my sake, because this is a sure dead-lock !!

And if you ask me It's the way you want it too, Because that's the RFC, and that's
what you'll need for 4.2.

And perhaps it should not be that hard to implement a Q abort, and not need that
unorthodox fencing which is not specified in the RFC. And it might be also possible to only
send IO of DS-B through MDS but keep the inflight IO to DS-A and DS-C valid and not resend.

> -->Andy
> 
>> Boaz


Thanks
Boaz


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs
  2012-06-05 13:36   ` Adamson, Andy
@ 2012-06-05 15:01     ` Boaz Harrosh
  0 siblings, 0 replies; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-05 15:01 UTC (permalink / raw)
  To: Adamson, Andy; +Cc: Myklebust, Trond, <linux-nfs@vger.kernel.org>

On 06/05/2012 04:36 PM, Adamson, Andy wrote:

> 
> On Jun 2, 2012, at 7:00 PM, Boaz Harrosh wrote:
> 
>> On 06/01/2012 08:19 PM, andros@netapp.com wrote:
>>
>>> From: Andy Adamson <andros@netapp.com>
>>>
>>
>>
>> I wish you would send your patches preceded with a [PATCHSET 0/x]
>> That gives us a short summery and a little background on its motivation.
> 

> I often do. These patches actually stand alone so I didn't this
> time.
> 
> This patch is totally explained by the patch title, patch comments,
> and in-line comments. With out this patch, we can end up sending a
> LAYOUTRETURN with no layout segments that need to be returned! Why?
> because as noted in the patch comments, mark_matching_lsegs_invalid
> does not indicate when the plh_segs list is empty…etc. When? well, as
> noted in the in-line code comments, when the pnfs_layout_hdr has an
> empty plh_segs which an occur when a LAYOUTGET fails, or when
> LAYOUTGET succeeded, but the deviceid is marked invalid
> 


Yes I noticed that the comment was very good. The in-code comment does
not replace the commit title/message. I would prefer it repeated also
in the patch text and/or in the cover-letter. Just to ease on the threshold
to entry.

> What else is there to say?!
> 
> I guess I can add the --cover-letter to the git format-patch and
> cut/paste to it.
> 


I'm not sure what you meant. But yes exactly ;-)

> -->Andy
> 
>>


Thanks
Boaz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-05 14:54       ` Boaz Harrosh
@ 2012-06-05 19:22         ` Andy Adamson
  2012-06-05 20:49           ` Boaz Harrosh
  2012-06-11  9:56           ` Benny Halevy
  0 siblings, 2 replies; 26+ messages in thread
From: Andy Adamson @ 2012-06-05 19:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Adamson, Andy, Myklebust, Trond, <linux-nfs@vger.kernel.org>

On Tue, Jun 5, 2012 at 10:54 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
> On 06/05/2012 04:36 PM, Adamson, Andy wrote:
>
>> On Jun 2, 2012, at 6:51 PM, Boaz Harrosh wrote:
>>>
>>> In objects-layout we must report all errors on layout_return. We
>>> accumulate them and report of all errors at once. So we need
>>> the return after all in flights are back. (And no new IOs are
>>> sent) Otherwise we might miss some.
>>
>
>> _pnfs_retrun_layout removes all layouts, and should therefore only be
>> called once.
>
>
> I agree current behavior is a bug, hence my apology.
>
>> I'll can add the 'wait for all in-flight' functionality,
>> and we can switch behaviors (wait or not wait).
>>
>
>
> I disagree you must "delay" the send see below.
>
>>> Also the RFC mandates that we do not use any layout or have
>>> IOs in flight, once we return the layout.
>>
>>
>> You are referring to this piece of the spec?
>> Section 18.44.3
>>
>>   After this call,
>>    the client MUST NOT use the returned layout(s) and the associated
>>    storage protocol to access the file data.
>>
>>
>
>> The above says that the client MUST NOT send any _new_ i/o using the
>> layout. I don't see any reference to in-flight i/o,
>
>
> I don't see reference to in-flight i/o either, so what does that mean?
> Yes or No? It does not it's vague. For me in-flight means "USING"

I shot an arrow in the air - where it lands I know not where. Am I
still using the arrow after I shoot?  If so, exaclty when do I know
that it is not in use by me?

If my RPC's time out,
is the DS using the layout? Suppose it's not a network partition, or a
DS reboot, but just a DS under really heavy load (or for some other
reason) and does not reply within the timeout? Is the client still
using the layout?

So I wait for the "answer", get timeouts, and I have no more
information than if I didn't wait for the answer.  Regardless, once
the decision is made on the client to not send any more i/o using that
data server, I should let the MDS know.


>
> Because in-flight means half was written/read and half was not, if the
> lo_return was received in the middle then the half that came after was
> using the layout information after the Server received an lo_return,
> which clearly violates the above.

No it doesn't. That just means the DS is using the layout. The client
is done using the layout until it sends new i/o using the layout.

>
> In any way, at this point in the code you do not have the information of
> If the RPCs are in the middle of the transfer, hence defined as in-flight.
> Or they are just inside your internal client queues and will be sent clearly
> after the lo_return which surly violates the above. (Without the need of
> explicit definition of in flight.)

We are past the transmit state in the RPC FSM for the errors that
trigger the LAYOUTRETURN.

>
>> nor should there
>> be in the error case. I get a connection error. Did the i/o's I sent
>> get to the data server?

If they get to the data server, does the data server use them?! We can
never know. That is exactly why the client is no longer "using" the
layout.

>
>
> We should always assume that statistically half was sent and half was not
> In any way we will send the all "uncertain range" again.

Sure. We should also let the MDS know that we are resending the
"uncertain range"  ASAP. Thus the LAYOUTRETURN.

>
>> The reason to send a LAYOUTRETURN without
>> waiting for all the in-flights to return with a connection error is
>> precisely to fence any in-flight i/o because I'm resending through
>> the MDS.
>>
>
>
> This is clearly in contradiction of the RFC.

I disagree.

> And is imposing a Server
> behavior that was not specified in RFC.

No server behavior is imposed. Just an opportunity for the server to
do the MUST stated below.

Section 13.6

   As described in Section 12.5.1, a client
   MUST NOT send an I/O to a data server for which it does not hold a
   valid layout; the data server MUST reject such an I/O.


>
> All started IO to the specified DS will return with "connection error"
> pretty fast, right?

Depends on the timeout.

> because the first disconnect probably took a timeout, but
> once the socket identified a disconnect it will stay in that state
> until a reconnect, right.
>
> So what are you attempting to do, Make your internal client Q drain very fast
> since you are going through MDS?

If by the internal client Q you mean the DS session slot_tbl_waitq,
that is a separate issue. Those RPC's are redirected internally upon
waking from the Q, they never get sent to the DS.

We do indeed wait for each in-flight RPC to error out before
re-sending the data of the failed RPC to the MDS.

Your theory that the LAYOUTRETURN we call will somehow speed up our
recovery is wrong.


> But you are doing that by assuming the
> Server will fence ALL IO,

What? No!

> and not by simply aborting your own Q.

See above. Of course we abort/redirect our Q.

We choose not to lose data. We do abort any RPC/NFS/Session queues and
re-direct. Only the in-flight RPC's which we have no idea of their
success are resent _after_ getting the error.  The LAYOUTRETURN is an
indication to the MDS that all is not well.

> Highly unorthodox

I'm open to suggestions. :) As I pointed out above, the only reason to
send the LAYOUTRETURN is to let the MDS know that some I/O might be
resent. Once the server gets the returned layout, it MUST reject any
I/O using that layout. (section 13.6).

> and certainly in violation of above.

I disagree.

-->Andy

>
>>
>>>
>>> I was under the impression that only when the last reference
>>> on a layout is dropped only then we send the lo_return.
>>
>>> If it is not so, this is the proper fix.
>>
>>>
>>> 1. Mark LO invalid so all new IO waits or goes to MDS
>>> 3. When LO ref drops to Zero send the lo_return.
>>
>> Yes for the normal case (evict inode for the file layout),
>
>> but not if I'm in an error situation and I want to fence the DS from in-flight i/o.
>
> A client has no way stated in the protocol to cause a "fence". This is just your
> wishful thinking. lo_return is something else, lo_return means I'm no longer using
> the file, not "please protect me from myself, because I will send more IO after that,
> but please ignore it"
>
> All you can do is abort your own client Q, all these RPCs that did not get sent
> or errored-out will be resent through MDS, there might be half an RPC of overlap.
>
> Think of 4.2 when you will need to report these errors to MDS - on lo_return -
> Your code will not work.
>
> Lets backtrack a second. Let me see if I understand what is your trick:
> 1. Say we have a file with a files-layout of which device_id specifies 3 DSs.
> 2. And say I have a large IO generated by an app to that file.
> 3. In the middle of the IO, one of the DSs, say DS-B, returns with a "connection error"
> 4 The RPCs stripe_units generated to DS-B will all quickly return with "connection error"
>  after the first error.
> 5. But the RPCs to DS-A and DS-C will continue to IO. Until done.
>
> But since you will send the entire range of IO through MDS you want that DS-A, DS-C to
> reject any farther IO, and any RPCs in client Qs for DS-A and DS-C will return quickly
> with "io rejected".
>
> Is that your idea?
>
>>
>>> 4. After LAYOUTRETURN_DONE is back, re-enable layouts.
>>> I'm so sorry it is not so today. I should have tested for
>>> this. I admit that all my error injection tests are
>>> single-file single thread so I did not test for this.
>>>
>>> Sigh, work never ends. Tell me if I can help with this
>>
>> I'll add the wait/no-wait…
>>
>
>
> I would not want a sync wait at all. Please don't code any sync wait for me. It will not
> work for objects, and will produce dead-locks.
>
> What I need is that a flag is raised on the lo_seg, and when last ref on the
> lo_seg drops an lo_return is sent. So either the call to layout_return causes
> the lo_return, or the last io_done of the inflights will cause the send.
> (You see io_done is the one that calls layout_return in the first place)
>
> !! BUT Please do not do any waits at all for my sake, because this is a sure dead-lock !!
>
> And if you ask me It's the way you want it too, Because that's the RFC, and that's
> what you'll need for 4.2.
>
> And perhaps it should not be that hard to implement a Q abort, and not need that
> unorthodox fencing which is not specified in the RFC. And it might be also possible to only
> send IO of DS-B through MDS but keep the inflight IO to DS-A and DS-C valid and not resend.
>
>> -->Andy
>>
>>> Boaz
>
>
> Thanks
> Boaz
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-05 19:22         ` Andy Adamson
@ 2012-06-05 20:49           ` Boaz Harrosh
  2012-06-11  9:56           ` Benny Halevy
  1 sibling, 0 replies; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-05 20:49 UTC (permalink / raw)
  To: Andy Adamson
  Cc: Adamson, Andy, Myklebust, Trond, <linux-nfs@vger.kernel.org>

On 06/05/2012 10:22 PM, Andy Adamson wrote:

> On Tue, Jun 5, 2012 at 10:54 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:

> 

I do not understand why the communication is so hard between us. Since I'm
the foreigner speaking, I'll take it on me. So I'll try to explain better.


> We are past the transmit state in the RPC FSM for the errors that
> trigger the LAYOUTRETURN.
> 


!! I'm not talking about the RPC that just produced the time-out and is
calling layout_return(), this one, I agree fully, is done with. and is out of our
hands. (We will not send a single byte on it)

I'm talking about the other requests that are still holding the reference
count on the layout. In what stage are they? can you guaranty at layout_return
that they will not send any more bytes after the above event?

If like you say below they are aborted then the reference will drop soon
enough. Right?

>>

<>

> 
> If they get to the data server, does the data server use them?! We can
> never know. That is exactly why the client is no longer "using" the
> layout.
> 


Are you sure?? again we are in a situation where one RPC has returned. But I see
that the layout has other requests using it. Hence reference count is not zero

Are you sure that the client is guarantied to not send a single byte after
this event. Even RPCs of the same layout but to different DSs that are
fine? If you are sure then please show me, because this is not how I read
this code.

The way I read this code is that it is all highly concurrent. You can't even
guaranty that you are not in the middle of a pagelist_write/read() surly before
any actual RPC send. What will cause that to stop? The only guaranty I see
is the reference count on the layout. It's the only barrier you have that
guaranties you are not sending any more bytes using this layout.

<>

> 
> If by the internal client Q you mean the DS session slot_tbl_waitq,
> that is a separate issue. Those RPC's are redirected internally upon
> waking from the Q, they never get sent to the DS.
> 
> We do indeed wait for each in-flight RPC to error out before
> re-sending the data of the failed RPC to the MDS.
> 
> Your theory that the LAYOUTRETURN we call will somehow speed up our
> recovery is wrong.
> 
> 
>> But you are doing that by assuming the
>> Server will fence ALL IO,
> 
> What? No!
> 
>> and not by simply aborting your own Q.
> 
> See above. Of course we abort/redirect our Q.
> 
> We choose not to lose data. We do abort any RPC/NFS/Session queues and
> re-direct. Only the in-flight RPC's which we have no idea of their
> success are resent _after_ getting the error.  The LAYOUTRETURN is an
> indication to the MDS that all is not well.
> 
>> Highly unorthodox
> 
> I'm open to suggestions. :) As I pointed out above, the only reason to
> send the LAYOUTRETURN is to let the MDS know that some I/O might be
> resent. Once the server gets the returned layout, it MUST reject any
> I/O using that layout. (section 13.6).
> 
>> and certainly in violation of above.
> 
> I disagree.
> 
> -->Andy
> 


You signed off here so surly you are not going to answer my most
important question. Which was

* I need to not sync-wait in layout return. I need a FLAG marked on
  the layout_segment which will cause it's layout_return on last reference.

And so do you. Because before the last reference you are not guarantied
that some other thread in the client is not busy sending bytes and/or
preparing new RPCs to be sent, to other DSs using the same layout.

Again I do not understand your motivation. Please if you answer any of my
comments answer this one first:

There are bunch of IO sent to multiple DSs and one RPC times out.

1. Some RPCs have been fully sent and are waiting reply
   I agree these are the arrows out of the bow and out of your hands
2. Some RPCs are in the middle of been sent, you started sending the
   header but not all the bytes. (Are there more than one per DS in this
   state)
3. Some RPCs are in internal client Queues and did not start transmission
4. Some RPCs are just been prepared by other threads they have taken the
   reference count on the layout_segment and will send new RPC soon.

Actually the above "one RPC timed out" is in the [1] group, right?

What you are saying is that we only guarantied to have state [1] RPCs.
That [2] [3] and [4] are out of the picture and have been aborted and/or
taken care of, and/or serialized by some locks. Well I find this hard to
believe. Certainly in objects layout I don't see any such guaranty.

And actually if you are right. Then why don't you do what I suggest
since it will be very soon after the current error-rpc that all the
rest will be aborted and the reference will drop, right?

The way you describe it only the RPCs in state [1] might take time
to return because they are out of your hand and might take a long
time to timeout.

So is it that you don't want to wait for these in state [1]?
I just want to understand.

And at last I want to come back to my concern.

* You want that the LAYOUTRETURN be sent as the *first* RPC that errored
  since you somehow magically guaranty that the client will not send a
  single byte after that. (And why I do not yet understand)

* But for objects-layout It needs the LAYOUTRETURN sent as part of the
  *last* IO in the batch of IOs that was sent as part of the layout.
   This is because it has no magic guaranties that bytes will not be sent
   at the error exit of some middle IO.

   And mainly because it must send a LAYOUTRETURN with all the errors it
   received. If the LAYOUTRETURN was sent with the first one, it might miss
   all the other errors. of the other IO requests. Actually it will be a
   memory leak.

So when you write the code could you please look into these things.

And one last thing:

You seem to be doing a full file LAYOUTRETURN as part of the layout_hdr
But objects and blocks (And also files I think) need a LAYOUTRETURN per
lo_segment. The handling (and ref-counting) should be completely
lo_segment based. In fact the Server and protocol knows nothing about
layout_hdr. What the RFC calls a LAYOUT is what the client named as lo_segment.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-05 19:22         ` Andy Adamson
  2012-06-05 20:49           ` Boaz Harrosh
@ 2012-06-11  9:56           ` Benny Halevy
  2012-06-11 10:44             ` Boaz Harrosh
  2012-06-11 15:08             ` Adamson, Andy
  1 sibling, 2 replies; 26+ messages in thread
From: Benny Halevy @ 2012-06-11  9:56 UTC (permalink / raw)
  To: Adamson, Andy
  Cc: Andy Adamson, Boaz Harrosh, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>

On 2012-06-05 22:22, Andy Adamson wrote:
> On Tue, Jun 5, 2012 at 10:54 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
>> On 06/05/2012 04:36 PM, Adamson, Andy wrote:
>>
>>> On Jun 2, 2012, at 6:51 PM, Boaz Harrosh wrote:
>>>>
>>>> In objects-layout we must report all errors on layout_return. We
>>>> accumulate them and report of all errors at once. So we need
>>>> the return after all in flights are back. (And no new IOs are
>>>> sent) Otherwise we might miss some.
>>>
>>
>>> _pnfs_retrun_layout removes all layouts, and should therefore only be
>>> called once.
>>
>>
>> I agree current behavior is a bug, hence my apology.
>>
>>> I'll can add the 'wait for all in-flight' functionality,
>>> and we can switch behaviors (wait or not wait).
>>>
>>
>>
>> I disagree you must "delay" the send see below.
>>
>>>> Also the RFC mandates that we do not use any layout or have
>>>> IOs in flight, once we return the layout.
>>>
>>>
>>> You are referring to this piece of the spec?
>>> Section 18.44.3
>>>
>>>   After this call,
>>>    the client MUST NOT use the returned layout(s) and the associated
>>>    storage protocol to access the file data.
>>>
>>>
>>
>>> The above says that the client MUST NOT send any _new_ i/o using the
>>> layout. I don't see any reference to in-flight i/o,

Our assumption when designing the objects layout, in particular with regards to
client-based RAID was that LAYOUTRETURN quiesces in flight I/Os so that other
clients or the MDS see a consistent parity-stripe state.

>>
>>
>> I don't see reference to in-flight i/o either, so what does that mean?
>> Yes or No? It does not it's vague. For me in-flight means "USING"
> 
> I shot an arrow in the air - where it lands I know not where. Am I
> still using the arrow after I shoot?  If so, exaclty when do I know
> that it is not in use by me?

You need to wait for in-flight I/Os to either succeed, fail (e.g. time out),
or be aborted.  The non-successful cases are going to be reported by the
objects layout driver so the MDS can recover from these errors.

> 
> If my RPC's time out,
> is the DS using the layout? Suppose it's not a network partition, or a
> DS reboot, but just a DS under really heavy load (or for some other
> reason) and does not reply within the timeout? Is the client still
> using the layout?
> 
> So I wait for the "answer", get timeouts, and I have no more
> information than if I didn't wait for the answer.  Regardless, once
> the decision is made on the client to not send any more i/o using that
> data server, I should let the MDS know.
> 
> 

Unfortunately that is too weak for client-based RAID.

>>
>> Because in-flight means half was written/read and half was not, if the
>> lo_return was received in the middle then the half that came after was
>> using the layout information after the Server received an lo_return,
>> which clearly violates the above.
> 
> No it doesn't. That just means the DS is using the layout. The client
> is done using the layout until it sends new i/o using the layout.
> 
>>
>> In any way, at this point in the code you do not have the information of
>> If the RPCs are in the middle of the transfer, hence defined as in-flight.
>> Or they are just inside your internal client queues and will be sent clearly
>> after the lo_return which surly violates the above. (Without the need of
>> explicit definition of in flight.)
> 
> We are past the transmit state in the RPC FSM for the errors that
> trigger the LAYOUTRETURN.
> 
>>
>>> nor should there
>>> be in the error case. I get a connection error. Did the i/o's I sent
>>> get to the data server?
> 
> If they get to the data server, does the data server use them?! We can
> never know. That is exactly why the client is no longer "using" the
> layout.
> 

That's fine from the objects MDS point of view.  What it needs to know
is whether the DS (OSD) committed the respective I/Os.

>>
>>
>> We should always assume that statistically half was sent and half was not
>> In any way we will send the all "uncertain range" again.
> 
> Sure. We should also let the MDS know that we are resending the
> "uncertain range"  ASAP. Thus the LAYOUTRETURN.
> 
>>
>>> The reason to send a LAYOUTRETURN without
>>> waiting for all the in-flights to return with a connection error is
>>> precisely to fence any in-flight i/o because I'm resending through
>>> the MDS.
>>>
>>
>>
>> This is clearly in contradiction of the RFC.
> 
> I disagree.
> 
>> And is imposing a Server
>> behavior that was not specified in RFC.
> 
> No server behavior is imposed. Just an opportunity for the server to
> do the MUST stated below.
> 
> Section 13.6
> 
>    As described in Section 12.5.1, a client
>    MUST NOT send an I/O to a data server for which it does not hold a
>    valid layout; the data server MUST reject such an I/O.
> 
> 

The DS fencing requirement is for file layout only.

>>
>> All started IO to the specified DS will return with "connection error"
>> pretty fast, right?
> 
> Depends on the timeout.
> 
>> because the first disconnect probably took a timeout, but
>> once the socket identified a disconnect it will stay in that state
>> until a reconnect, right.
>>
>> So what are you attempting to do, Make your internal client Q drain very fast
>> since you are going through MDS?
> 
> If by the internal client Q you mean the DS session slot_tbl_waitq,
> that is a separate issue. Those RPC's are redirected internally upon
> waking from the Q, they never get sent to the DS.
> 
> We do indeed wait for each in-flight RPC to error out before
> re-sending the data of the failed RPC to the MDS.
> 
> Your theory that the LAYOUTRETURN we call will somehow speed up our
> recovery is wrong.
> 
> 

It will for recovering files striped with RAID as the LAYOUTRETURN
provides the server with a reliable "commit point" where it knows
exactly what was written successfully to each stripe and can make
the most efficient decision about recovering it (if needed).

Allowing I/O to the stripe post LAYOUTRETURN may result in data
corruption due to parity inconsistency.

Benny

>> But you are doing that by assuming the
>> Server will fence ALL IO,
> 
> What? No!
> 
>> and not by simply aborting your own Q.
> 
> See above. Of course we abort/redirect our Q.
> 
> We choose not to lose data. We do abort any RPC/NFS/Session queues and
> re-direct. Only the in-flight RPC's which we have no idea of their
> success are resent _after_ getting the error.  The LAYOUTRETURN is an
> indication to the MDS that all is not well.
> 
>> Highly unorthodox
> 
> I'm open to suggestions. :) As I pointed out above, the only reason to
> send the LAYOUTRETURN is to let the MDS know that some I/O might be
> resent. Once the server gets the returned layout, it MUST reject any
> I/O using that layout. (section 13.6).
> 
>> and certainly in violation of above.
> 
> I disagree.
> 
> -->Andy
> 
>>
>>>
>>>>
>>>> I was under the impression that only when the last reference
>>>> on a layout is dropped only then we send the lo_return.
>>>
>>>> If it is not so, this is the proper fix.
>>>
>>>>
>>>> 1. Mark LO invalid so all new IO waits or goes to MDS
>>>> 3. When LO ref drops to Zero send the lo_return.
>>>
>>> Yes for the normal case (evict inode for the file layout),
>>
>>> but not if I'm in an error situation and I want to fence the DS from in-flight i/o.
>>
>> A client has no way stated in the protocol to cause a "fence". This is just your
>> wishful thinking. lo_return is something else, lo_return means I'm no longer using
>> the file, not "please protect me from myself, because I will send more IO after that,
>> but please ignore it"
>>
>> All you can do is abort your own client Q, all these RPCs that did not get sent
>> or errored-out will be resent through MDS, there might be half an RPC of overlap.
>>
>> Think of 4.2 when you will need to report these errors to MDS - on lo_return -
>> Your code will not work.
>>
>> Lets backtrack a second. Let me see if I understand what is your trick:
>> 1. Say we have a file with a files-layout of which device_id specifies 3 DSs.
>> 2. And say I have a large IO generated by an app to that file.
>> 3. In the middle of the IO, one of the DSs, say DS-B, returns with a "connection error"
>> 4 The RPCs stripe_units generated to DS-B will all quickly return with "connection error"
>>  after the first error.
>> 5. But the RPCs to DS-A and DS-C will continue to IO. Until done.
>>
>> But since you will send the entire range of IO through MDS you want that DS-A, DS-C to
>> reject any farther IO, and any RPCs in client Qs for DS-A and DS-C will return quickly
>> with "io rejected".
>>
>> Is that your idea?
>>
>>>
>>>> 4. After LAYOUTRETURN_DONE is back, re-enable layouts.
>>>> I'm so sorry it is not so today. I should have tested for
>>>> this. I admit that all my error injection tests are
>>>> single-file single thread so I did not test for this.
>>>>
>>>> Sigh, work never ends. Tell me if I can help with this
>>>
>>> I'll add the wait/no-wait…
>>>
>>
>>
>> I would not want a sync wait at all. Please don't code any sync wait for me. It will not
>> work for objects, and will produce dead-locks.
>>
>> What I need is that a flag is raised on the lo_seg, and when last ref on the
>> lo_seg drops an lo_return is sent. So either the call to layout_return causes
>> the lo_return, or the last io_done of the inflights will cause the send.
>> (You see io_done is the one that calls layout_return in the first place)
>>
>> !! BUT Please do not do any waits at all for my sake, because this is a sure dead-lock !!
>>
>> And if you ask me It's the way you want it too, Because that's the RFC, and that's
>> what you'll need for 4.2.
>>
>> And perhaps it should not be that hard to implement a Q abort, and not need that
>> unorthodox fencing which is not specified in the RFC. And it might be also possible to only
>> send IO of DS-B through MDS but keep the inflight IO to DS-A and DS-C valid and not resend.
>>
>>> -->Andy
>>>
>>>> Boaz
>>
>>
>> Thanks
>> Boaz
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11  9:56           ` Benny Halevy
@ 2012-06-11 10:44             ` Boaz Harrosh
  2012-06-11 14:04               ` Benny Halevy
  2012-06-11 15:08             ` Adamson, Andy
  1 sibling, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-11 10:44 UTC (permalink / raw)
  To: Benny Halevy, Welch, Brent, Bhamare, Sachin
  Cc: Adamson, Andy, Andy Adamson, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>,
	Huang, James

On 06/11/2012 12:56 PM, Benny Halevy wrote:

> 
> It will for recovering files striped with RAID as the LAYOUTRETURN
> provides the server with a reliable "commit point" where it knows
> exactly what was written successfully to each stripe and can make
> the most efficient decision about recovering it (if needed).
> 
> Allowing I/O to the stripe post LAYOUTRETURN may result in data
> corruption due to parity inconsistency.
> 


The fact of the matter is that we - object guys - Which care about
LAYOUTRETURN the most, have neglected to notice and fix the current
situation which is even worse than After Andy's patch. Which will
send multiple LAYOUTRETURNs on the same layout, and might still
continue IO on returned layouts.

Thanks Andy for testing and highlighting the current problem,
I will prioritized this development internally at Panasas and
will work on a solution ASAP. I hope one that will satisfy
all.

Basically my plan is to postpone the LAYOUTRETURN send to
when the segment is last referenced, if a flag was set to
do so, and make sure reference counting is done proper, so
after flag set the last IO_done will send the LAYOUTRETURN.

If that is OK with all I hope?

> Benny
> 


Thanks
Boaz


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 10:44             ` Boaz Harrosh
@ 2012-06-11 14:04               ` Benny Halevy
  2012-06-11 14:21                 ` Adamson, Andy
  0 siblings, 1 reply; 26+ messages in thread
From: Benny Halevy @ 2012-06-11 14:04 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Welch, Brent, Bhamare, Sachin, Adamson, Andy, Andy Adamson,
	Myklebust, Trond, <linux-nfs@vger.kernel.org>,
	Huang, James

On 2012-06-11 13:44, Boaz Harrosh wrote:
> On 06/11/2012 12:56 PM, Benny Halevy wrote:
> 
>>
>> It will for recovering files striped with RAID as the LAYOUTRETURN
>> provides the server with a reliable "commit point" where it knows
>> exactly what was written successfully to each stripe and can make
>> the most efficient decision about recovering it (if needed).
>>
>> Allowing I/O to the stripe post LAYOUTRETURN may result in data
>> corruption due to parity inconsistency.
>>
> 
> 
> The fact of the matter is that we - object guys - Which care about
> LAYOUTRETURN the most, have neglected to notice and fix the current
> situation which is even worse than After Andy's patch. Which will
> send multiple LAYOUTRETURNs on the same layout, and might still
> continue IO on returned layouts.
> 
> Thanks Andy for testing and highlighting the current problem,
> I will prioritized this development internally at Panasas and
> will work on a solution ASAP. I hope one that will satisfy
> all.
> 
> Basically my plan is to postpone the LAYOUTRETURN send to
> when the segment is last referenced, if a flag was set to
> do so, and make sure reference counting is done proper, so
> after flag set the last IO_done will send the LAYOUTRETURN.
> 
> If that is OK with all I hope?

Sounds OK to me.  Thanks!

Benny

> 
>> Benny
>>
> 
> 
> Thanks
> Boaz
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 14:04               ` Benny Halevy
@ 2012-06-11 14:21                 ` Adamson, Andy
  2012-06-11 14:51                   ` Boaz Harrosh
  0 siblings, 1 reply; 26+ messages in thread
From: Adamson, Andy @ 2012-06-11 14:21 UTC (permalink / raw)
  To: Benny Halevy
  Cc: Boaz Harrosh, Welch, Brent, Bhamare, Sachin, Adamson, Andy,
	Andy Adamson, Myklebust, Trond, <linux-nfs@vger.kernel.org>,
	Huang, James


On Jun 11, 2012, at 10:04 AM, Benny Halevy wrote:

> On 2012-06-11 13:44, Boaz Harrosh wrote:
>> On 06/11/2012 12:56 PM, Benny Halevy wrote:
>> 
>>> 
>>> It will for recovering files striped with RAID as the LAYOUTRETURN
>>> provides the server with a reliable "commit point" where it knows
>>> exactly what was written successfully to each stripe and can make
>>> the most efficient decision about recovering it (if needed).
>>> 
>>> Allowing I/O to the stripe post LAYOUTRETURN may result in data
>>> corruption due to parity inconsistency.
>>> 
>> 
>> 
>> The fact of the matter is that we - object guys - Which care about
>> LAYOUTRETURN the most, have neglected to notice and fix the current
>> situation which is even worse than After Andy's patch. Which will
>> send multiple LAYOUTRETURNs on the same layout, and might still
>> continue IO on returned layouts.
>> 
>> Thanks Andy for testing and highlighting the current problem,
>> I will prioritized this development internally at Panasas and
>> will work on a solution ASAP. I hope one that will satisfy
>> all.
>> 
>> Basically my plan is to postpone the LAYOUTRETURN send to
>> when the segment is last referenced, if a flag was set to
>> do so, and make sure reference counting is done proper, so
>> after flag set the last IO_done will send the LAYOUTRETURN.
>> 
>> If that is OK with all I hope?
> 
> Sounds OK to me.  Thanks!

Sounds fine to me as well, for LAYOUTRETURN during normal behavior.

-->Andy

> 
> Benny
> 
>> 
>>> Benny
>>> 
>> 
>> 
>> Thanks
>> Boaz
>> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 14:21                 ` Adamson, Andy
@ 2012-06-11 14:51                   ` Boaz Harrosh
  2012-06-11 15:41                     ` Adamson, Andy
  0 siblings, 1 reply; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-11 14:51 UTC (permalink / raw)
  To: Adamson, Andy
  Cc: Benny Halevy, Welch, Brent, Bhamare, Sachin, Andy Adamson,
	Myklebust, Trond, <linux-nfs@vger.kernel.org>,
	Huang, James

On 06/11/2012 05:21 PM, Adamson, Andy wrote:

> 
> On Jun 11, 2012, at 10:04 AM, Benny Halevy wrote:
> 
>> On 2012-06-11 13:44, Boaz Harrosh wrote:
>>> Basically my plan is to postpone the LAYOUTRETURN send to
>>> when the segment is last referenced, if a flag was set to
>>> do so, and make sure reference counting is done proper, so
>>> after flag set the last IO_done will send the LAYOUTRETURN.
>>>
>>> If that is OK with all I hope?
>>
>> Sounds OK to me.  Thanks!
> 
> Sounds fine to me as well, for LAYOUTRETURN during normal behavior.
> 
> -->Andy


I'm not sure I understood what you meant.

In the "normal behavior" we don't send a LAYOUTRETURN at all
in the forgetful model. Which proved beneficiary in the light
of concurrent GETs vs RETURNs.

I thought we are only taking about the error case. Do you
mean there is a 3rd case? please explain.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11  9:56           ` Benny Halevy
  2012-06-11 10:44             ` Boaz Harrosh
@ 2012-06-11 15:08             ` Adamson, Andy
  2012-06-11 15:38               ` Benny Halevy
  2012-06-11 15:59               ` Boaz Harrosh
  1 sibling, 2 replies; 26+ messages in thread
From: Adamson, Andy @ 2012-06-11 15:08 UTC (permalink / raw)
  To: Benny Halevy
  Cc: Adamson, Andy, Andy Adamson, Boaz Harrosh, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>


On Jun 11, 2012, at 5:56 AM, Benny Halevy wrote:

> On 2012-06-05 22:22, Andy Adamson wrote:
>> On Tue, Jun 5, 2012 at 10:54 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
>>> On 06/05/2012 04:36 PM, Adamson, Andy wrote:
>>> 
>>>> On Jun 2, 2012, at 6:51 PM, Boaz Harrosh wrote:
>>>>> 
>>>>> In objects-layout we must report all errors on layout_return. We
>>>>> accumulate them and report of all errors at once. So we need
>>>>> the return after all in flights are back. (And no new IOs are
>>>>> sent) Otherwise we might miss some.
>>>> 
>>> 
>>>> _pnfs_retrun_layout removes all layouts, and should therefore only be
>>>> called once.
>>> 
>>> 
>>> I agree current behavior is a bug, hence my apology.
>>> 
>>>> I'll can add the 'wait for all in-flight' functionality,
>>>> and we can switch behaviors (wait or not wait).
>>>> 
>>> 
>>> 
>>> I disagree you must "delay" the send see below.
>>> 
>>>>> Also the RFC mandates that we do not use any layout or have
>>>>> IOs in flight, once we return the layout.
>>>> 
>>>> 
>>>> You are referring to this piece of the spec?
>>>> Section 18.44.3
>>>> 
>>>>  After this call,
>>>>   the client MUST NOT use the returned layout(s) and the associated
>>>>   storage protocol to access the file data.
>>>> 
>>>> 
>>> 
>>>> The above says that the client MUST NOT send any _new_ i/o using the
>>>> layout. I don't see any reference to in-flight i/o,
> 
> Our assumption when designing the objects layout, in particular with regards to
> client-based RAID was that LAYOUTRETURN quiesces in flight I/Os so that other
> clients or the MDS see a consistent parity-stripe state.

I'm not suggesting any LAYOUTRETURN behavior for objects.

I'm coding a LAYOUTRETURN to allow the MDS to fence a file layout data server. 

> 
>>> 
>>> 
>>> I don't see reference to in-flight i/o either, so what does that mean?
>>> Yes or No? It does not it's vague. For me in-flight means "USING"
>> 
>> I shot an arrow in the air - where it lands I know not where. Am I
>> still using the arrow after I shoot?  If so, exaclty when do I know
>> that it is not in use by me?
> 
> You need to wait for in-flight I/Os to either succeed, fail (e.g. time out),
> or be aborted.  The non-successful cases are going to be reported by the
> objects layout driver so the MDS can recover from these errors.

The object layout driver may have this requirement, but the file layout driver does not.

> 
>> 
>> If my RPC's time out,
>> is the DS using the layout? Suppose it's not a network partition, or a
>> DS reboot, but just a DS under really heavy load (or for some other
>> reason) and does not reply within the timeout? Is the client still
>> using the layout?
>> 
>> So I wait for the "answer", get timeouts, and I have no more
>> information than if I didn't wait for the answer.  Regardless, once
>> the decision is made on the client to not send any more i/o using that
>> data server, I should let the MDS know.
>> 
>> 
> 
> Unfortunately that is too weak for client-based RAID.
> 
>>> 
>>> Because in-flight means half was written/read and half was not, if the
>>> lo_return was received in the middle then the half that came after was
>>> using the layout information after the Server received an lo_return,
>>> which clearly violates the above.
>> 
>> No it doesn't. That just means the DS is using the layout. The client
>> is done using the layout until it sends new i/o using the layout.
>> 
>>> 
>>> In any way, at this point in the code you do not have the information of
>>> If the RPCs are in the middle of the transfer, hence defined as in-flight.
>>> Or they are just inside your internal client queues and will be sent clearly
>>> after the lo_return which surly violates the above. (Without the need of
>>> explicit definition of in flight.)
>> 
>> We are past the transmit state in the RPC FSM for the errors that
>> trigger the LAYOUTRETURN.
>> 
>>> 
>>>> nor should there
>>>> be in the error case. I get a connection error. Did the i/o's I sent
>>>> get to the data server?
>> 
>> If they get to the data server, does the data server use them?! We can
>> never know. That is exactly why the client is no longer "using" the
>> layout.
>> 
> 
> That's fine from the objects MDS point of view.  What it needs to know
> is whether the DS (OSD) committed the respective I/Os.
> 
>>> 
>>> 
>>> We should always assume that statistically half was sent and half was not
>>> In any way we will send the all "uncertain range" again.
>> 
>> Sure. We should also let the MDS know that we are resending the
>> "uncertain range"  ASAP. Thus the LAYOUTRETURN.
>> 
>>> 
>>>> The reason to send a LAYOUTRETURN without
>>>> waiting for all the in-flights to return with a connection error is
>>>> precisely to fence any in-flight i/o because I'm resending through
>>>> the MDS.
>>>> 
>>> 
>>> 
>>> This is clearly in contradiction of the RFC.
>> 
>> I disagree.
>> 
>>> And is imposing a Server
>>> behavior that was not specified in RFC.
>> 
>> No server behavior is imposed. Just an opportunity for the server to
>> do the MUST stated below.
>> 
>> Section 13.6
>> 
>>   As described in Section 12.5.1, a client
>>   MUST NOT send an I/O to a data server for which it does not hold a
>>   valid layout; the data server MUST reject such an I/O.
>> 
>> 
> 
> The DS fencing requirement is for file layout only.

Which is what I am coding - a file layout fence using LAYOUTRETURN.

> 
>>> 
>>> All started IO to the specified DS will return with "connection error"
>>> pretty fast, right?
>> 
>> Depends on the timeout.
>> 
>>> because the first disconnect probably took a timeout, but
>>> once the socket identified a disconnect it will stay in that state
>>> until a reconnect, right.
>>> 
>>> So what are you attempting to do, Make your internal client Q drain very fast
>>> since you are going through MDS?
>> 
>> If by the internal client Q you mean the DS session slot_tbl_waitq,
>> that is a separate issue. Those RPC's are redirected internally upon
>> waking from the Q, they never get sent to the DS.
>> 
>> We do indeed wait for each in-flight RPC to error out before
>> re-sending the data of the failed RPC to the MDS.
>> 
>> Your theory that the LAYOUTRETURN we call will somehow speed up our
>> recovery is wrong.
>> 
>> 
> 
> It will for recovering files striped with RAID as the LAYOUTRETURN
> provides the server with a reliable "commit point" where it knows
> exactly what was written successfully to each stripe and can make
> the most efficient decision about recovering it (if needed).
> 
> Allowing I/O to the stripe post LAYOUTRETURN may result in data
> corruption due to parity inconsistency.

For objects.

-->Andy

> 
> Benny
> 
>>> But you are doing that by assuming the
>>> Server will fence ALL IO,
>> 
>> What? No!
>> 
>>> and not by simply aborting your own Q.
>> 
>> See above. Of course we abort/redirect our Q.
>> 
>> We choose not to lose data. We do abort any RPC/NFS/Session queues and
>> re-direct. Only the in-flight RPC's which we have no idea of their
>> success are resent _after_ getting the error.  The LAYOUTRETURN is an
>> indication to the MDS that all is not well.
>> 
>>> Highly unorthodox
>> 
>> I'm open to suggestions. :) As I pointed out above, the only reason to
>> send the LAYOUTRETURN is to let the MDS know that some I/O might be
>> resent. Once the server gets the returned layout, it MUST reject any
>> I/O using that layout. (section 13.6).
>> 
>>> and certainly in violation of above.
>> 
>> I disagree.
>> 
>> -->Andy
>> 
>>> 
>>>> 
>>>>> 
>>>>> I was under the impression that only when the last reference
>>>>> on a layout is dropped only then we send the lo_return.
>>>> 
>>>>> If it is not so, this is the proper fix.
>>>> 
>>>>> 
>>>>> 1. Mark LO invalid so all new IO waits or goes to MDS
>>>>> 3. When LO ref drops to Zero send the lo_return.
>>>> 
>>>> Yes for the normal case (evict inode for the file layout),
>>> 
>>>> but not if I'm in an error situation and I want to fence the DS from in-flight i/o.
>>> 
>>> A client has no way stated in the protocol to cause a "fence". This is just your
>>> wishful thinking. lo_return is something else, lo_return means I'm no longer using
>>> the file, not "please protect me from myself, because I will send more IO after that,
>>> but please ignore it"
>>> 
>>> All you can do is abort your own client Q, all these RPCs that did not get sent
>>> or errored-out will be resent through MDS, there might be half an RPC of overlap.
>>> 
>>> Think of 4.2 when you will need to report these errors to MDS - on lo_return -
>>> Your code will not work.
>>> 
>>> Lets backtrack a second. Let me see if I understand what is your trick:
>>> 1. Say we have a file with a files-layout of which device_id specifies 3 DSs.
>>> 2. And say I have a large IO generated by an app to that file.
>>> 3. In the middle of the IO, one of the DSs, say DS-B, returns with a "connection error"
>>> 4 The RPCs stripe_units generated to DS-B will all quickly return with "connection error"
>>> after the first error.
>>> 5. But the RPCs to DS-A and DS-C will continue to IO. Until done.
>>> 
>>> But since you will send the entire range of IO through MDS you want that DS-A, DS-C to
>>> reject any farther IO, and any RPCs in client Qs for DS-A and DS-C will return quickly
>>> with "io rejected".
>>> 
>>> Is that your idea?
>>> 
>>>> 
>>>>> 4. After LAYOUTRETURN_DONE is back, re-enable layouts.
>>>>> I'm so sorry it is not so today. I should have tested for
>>>>> this. I admit that all my error injection tests are
>>>>> single-file single thread so I did not test for this.
>>>>> 
>>>>> Sigh, work never ends. Tell me if I can help with this
>>>> 
>>>> I'll add the wait/no-wait…
>>>> 
>>> 
>>> 
>>> I would not want a sync wait at all. Please don't code any sync wait for me. It will not
>>> work for objects, and will produce dead-locks.
>>> 
>>> What I need is that a flag is raised on the lo_seg, and when last ref on the
>>> lo_seg drops an lo_return is sent. So either the call to layout_return causes
>>> the lo_return, or the last io_done of the inflights will cause the send.
>>> (You see io_done is the one that calls layout_return in the first place)
>>> 
>>> !! BUT Please do not do any waits at all for my sake, because this is a sure dead-lock !!
>>> 
>>> And if you ask me It's the way you want it too, Because that's the RFC, and that's
>>> what you'll need for 4.2.
>>> 
>>> And perhaps it should not be that hard to implement a Q abort, and not need that
>>> unorthodox fencing which is not specified in the RFC. And it might be also possible to only
>>> send IO of DS-B through MDS but keep the inflight IO to DS-A and DS-C valid and not resend.
>>> 
>>>> -->Andy
>>>> 
>>>>> Boaz
>>> 
>>> 
>>> Thanks
>>> Boaz
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 15:08             ` Adamson, Andy
@ 2012-06-11 15:38               ` Benny Halevy
  2012-06-11 15:52                 ` Adamson, Andy
  2012-06-11 15:59               ` Boaz Harrosh
  1 sibling, 1 reply; 26+ messages in thread
From: Benny Halevy @ 2012-06-11 15:38 UTC (permalink / raw)
  To: Adamson, Andy
  Cc: Andy Adamson, Boaz Harrosh, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>

On 2012-06-11 18:08, Adamson, Andy wrote:
> 
> On Jun 11, 2012, at 5:56 AM, Benny Halevy wrote:
> 
>> On 2012-06-05 22:22, Andy Adamson wrote:
>>> On Tue, Jun 5, 2012 at 10:54 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>> On 06/05/2012 04:36 PM, Adamson, Andy wrote:
>>>>
>>>>> On Jun 2, 2012, at 6:51 PM, Boaz Harrosh wrote:
>>>>>>
>>>>>> In objects-layout we must report all errors on layout_return. We
>>>>>> accumulate them and report of all errors at once. So we need
>>>>>> the return after all in flights are back. (And no new IOs are
>>>>>> sent) Otherwise we might miss some.
>>>>>
>>>>
>>>>> _pnfs_retrun_layout removes all layouts, and should therefore only be
>>>>> called once.
>>>>
>>>>
>>>> I agree current behavior is a bug, hence my apology.
>>>>
>>>>> I'll can add the 'wait for all in-flight' functionality,
>>>>> and we can switch behaviors (wait or not wait).
>>>>>
>>>>
>>>>
>>>> I disagree you must "delay" the send see below.
>>>>
>>>>>> Also the RFC mandates that we do not use any layout or have
>>>>>> IOs in flight, once we return the layout.
>>>>>
>>>>>
>>>>> You are referring to this piece of the spec?
>>>>> Section 18.44.3
>>>>>
>>>>>  After this call,
>>>>>   the client MUST NOT use the returned layout(s) and the associated
>>>>>   storage protocol to access the file data.
>>>>>
>>>>>
>>>>
>>>>> The above says that the client MUST NOT send any _new_ i/o using the
>>>>> layout. I don't see any reference to in-flight i/o,
>>
>> Our assumption when designing the objects layout, in particular with regards to
>> client-based RAID was that LAYOUTRETURN quiesces in flight I/Os so that other
>> clients or the MDS see a consistent parity-stripe state.
> 
> I'm not suggesting any LAYOUTRETURN behavior for objects.
> 
> I'm coding a LAYOUTRETURN to allow the MDS to fence a file layout data server. 
> 

Yeah, right now pnfs_return_layout always returns the whole layout
so this patch works.  Boaz's point (I think) was that marking the whole layout
as returned in the generic layer will prevent returning of layout segments
by the objects layout.

Benny

>>
>>>>
>>>>
>>>> I don't see reference to in-flight i/o either, so what does that mean?
>>>> Yes or No? It does not it's vague. For me in-flight means "USING"
>>>
>>> I shot an arrow in the air - where it lands I know not where. Am I
>>> still using the arrow after I shoot?  If so, exaclty when do I know
>>> that it is not in use by me?
>>
>> You need to wait for in-flight I/Os to either succeed, fail (e.g. time out),
>> or be aborted.  The non-successful cases are going to be reported by the
>> objects layout driver so the MDS can recover from these errors.
> 
> The object layout driver may have this requirement, but the file layout driver does not.
> 
>>
>>>
>>> If my RPC's time out,
>>> is the DS using the layout? Suppose it's not a network partition, or a
>>> DS reboot, but just a DS under really heavy load (or for some other
>>> reason) and does not reply within the timeout? Is the client still
>>> using the layout?
>>>
>>> So I wait for the "answer", get timeouts, and I have no more
>>> information than if I didn't wait for the answer.  Regardless, once
>>> the decision is made on the client to not send any more i/o using that
>>> data server, I should let the MDS know.
>>>
>>>
>>
>> Unfortunately that is too weak for client-based RAID.
>>
>>>>
>>>> Because in-flight means half was written/read and half was not, if the
>>>> lo_return was received in the middle then the half that came after was
>>>> using the layout information after the Server received an lo_return,
>>>> which clearly violates the above.
>>>
>>> No it doesn't. That just means the DS is using the layout. The client
>>> is done using the layout until it sends new i/o using the layout.
>>>
>>>>
>>>> In any way, at this point in the code you do not have the information of
>>>> If the RPCs are in the middle of the transfer, hence defined as in-flight.
>>>> Or they are just inside your internal client queues and will be sent clearly
>>>> after the lo_return which surly violates the above. (Without the need of
>>>> explicit definition of in flight.)
>>>
>>> We are past the transmit state in the RPC FSM for the errors that
>>> trigger the LAYOUTRETURN.
>>>
>>>>
>>>>> nor should there
>>>>> be in the error case. I get a connection error. Did the i/o's I sent
>>>>> get to the data server?
>>>
>>> If they get to the data server, does the data server use them?! We can
>>> never know. That is exactly why the client is no longer "using" the
>>> layout.
>>>
>>
>> That's fine from the objects MDS point of view.  What it needs to know
>> is whether the DS (OSD) committed the respective I/Os.
>>
>>>>
>>>>
>>>> We should always assume that statistically half was sent and half was not
>>>> In any way we will send the all "uncertain range" again.
>>>
>>> Sure. We should also let the MDS know that we are resending the
>>> "uncertain range"  ASAP. Thus the LAYOUTRETURN.
>>>
>>>>
>>>>> The reason to send a LAYOUTRETURN without
>>>>> waiting for all the in-flights to return with a connection error is
>>>>> precisely to fence any in-flight i/o because I'm resending through
>>>>> the MDS.
>>>>>
>>>>
>>>>
>>>> This is clearly in contradiction of the RFC.
>>>
>>> I disagree.
>>>
>>>> And is imposing a Server
>>>> behavior that was not specified in RFC.
>>>
>>> No server behavior is imposed. Just an opportunity for the server to
>>> do the MUST stated below.
>>>
>>> Section 13.6
>>>
>>>   As described in Section 12.5.1, a client
>>>   MUST NOT send an I/O to a data server for which it does not hold a
>>>   valid layout; the data server MUST reject such an I/O.
>>>
>>>
>>
>> The DS fencing requirement is for file layout only.
> 
> Which is what I am coding - a file layout fence using LAYOUTRETURN.
> 
>>
>>>>
>>>> All started IO to the specified DS will return with "connection error"
>>>> pretty fast, right?
>>>
>>> Depends on the timeout.
>>>
>>>> because the first disconnect probably took a timeout, but
>>>> once the socket identified a disconnect it will stay in that state
>>>> until a reconnect, right.
>>>>
>>>> So what are you attempting to do, Make your internal client Q drain very fast
>>>> since you are going through MDS?
>>>
>>> If by the internal client Q you mean the DS session slot_tbl_waitq,
>>> that is a separate issue. Those RPC's are redirected internally upon
>>> waking from the Q, they never get sent to the DS.
>>>
>>> We do indeed wait for each in-flight RPC to error out before
>>> re-sending the data of the failed RPC to the MDS.
>>>
>>> Your theory that the LAYOUTRETURN we call will somehow speed up our
>>> recovery is wrong.
>>>
>>>
>>
>> It will for recovering files striped with RAID as the LAYOUTRETURN
>> provides the server with a reliable "commit point" where it knows
>> exactly what was written successfully to each stripe and can make
>> the most efficient decision about recovering it (if needed).
>>
>> Allowing I/O to the stripe post LAYOUTRETURN may result in data
>> corruption due to parity inconsistency.
> 
> For objects.
> 
> -->Andy
> 
>>
>> Benny
>>
>>>> But you are doing that by assuming the
>>>> Server will fence ALL IO,
>>>
>>> What? No!
>>>
>>>> and not by simply aborting your own Q.
>>>
>>> See above. Of course we abort/redirect our Q.
>>>
>>> We choose not to lose data. We do abort any RPC/NFS/Session queues and
>>> re-direct. Only the in-flight RPC's which we have no idea of their
>>> success are resent _after_ getting the error.  The LAYOUTRETURN is an
>>> indication to the MDS that all is not well.
>>>
>>>> Highly unorthodox
>>>
>>> I'm open to suggestions. :) As I pointed out above, the only reason to
>>> send the LAYOUTRETURN is to let the MDS know that some I/O might be
>>> resent. Once the server gets the returned layout, it MUST reject any
>>> I/O using that layout. (section 13.6).
>>>
>>>> and certainly in violation of above.
>>>
>>> I disagree.
>>>
>>> -->Andy
>>>
>>>>
>>>>>
>>>>>>
>>>>>> I was under the impression that only when the last reference
>>>>>> on a layout is dropped only then we send the lo_return.
>>>>>
>>>>>> If it is not so, this is the proper fix.
>>>>>
>>>>>>
>>>>>> 1. Mark LO invalid so all new IO waits or goes to MDS
>>>>>> 3. When LO ref drops to Zero send the lo_return.
>>>>>
>>>>> Yes for the normal case (evict inode for the file layout),
>>>>
>>>>> but not if I'm in an error situation and I want to fence the DS from in-flight i/o.
>>>>
>>>> A client has no way stated in the protocol to cause a "fence". This is just your
>>>> wishful thinking. lo_return is something else, lo_return means I'm no longer using
>>>> the file, not "please protect me from myself, because I will send more IO after that,
>>>> but please ignore it"
>>>>
>>>> All you can do is abort your own client Q, all these RPCs that did not get sent
>>>> or errored-out will be resent through MDS, there might be half an RPC of overlap.
>>>>
>>>> Think of 4.2 when you will need to report these errors to MDS - on lo_return -
>>>> Your code will not work.
>>>>
>>>> Lets backtrack a second. Let me see if I understand what is your trick:
>>>> 1. Say we have a file with a files-layout of which device_id specifies 3 DSs.
>>>> 2. And say I have a large IO generated by an app to that file.
>>>> 3. In the middle of the IO, one of the DSs, say DS-B, returns with a "connection error"
>>>> 4 The RPCs stripe_units generated to DS-B will all quickly return with "connection error"
>>>> after the first error.
>>>> 5. But the RPCs to DS-A and DS-C will continue to IO. Until done.
>>>>
>>>> But since you will send the entire range of IO through MDS you want that DS-A, DS-C to
>>>> reject any farther IO, and any RPCs in client Qs for DS-A and DS-C will return quickly
>>>> with "io rejected".
>>>>
>>>> Is that your idea?
>>>>
>>>>>
>>>>>> 4. After LAYOUTRETURN_DONE is back, re-enable layouts.
>>>>>> I'm so sorry it is not so today. I should have tested for
>>>>>> this. I admit that all my error injection tests are
>>>>>> single-file single thread so I did not test for this.
>>>>>>
>>>>>> Sigh, work never ends. Tell me if I can help with this
>>>>>
>>>>> I'll add the wait/no-wait…
>>>>>
>>>>
>>>>
>>>> I would not want a sync wait at all. Please don't code any sync wait for me. It will not
>>>> work for objects, and will produce dead-locks.
>>>>
>>>> What I need is that a flag is raised on the lo_seg, and when last ref on the
>>>> lo_seg drops an lo_return is sent. So either the call to layout_return causes
>>>> the lo_return, or the last io_done of the inflights will cause the send.
>>>> (You see io_done is the one that calls layout_return in the first place)
>>>>
>>>> !! BUT Please do not do any waits at all for my sake, because this is a sure dead-lock !!
>>>>
>>>> And if you ask me It's the way you want it too, Because that's the RFC, and that's
>>>> what you'll need for 4.2.
>>>>
>>>> And perhaps it should not be that hard to implement a Q abort, and not need that
>>>> unorthodox fencing which is not specified in the RFC. And it might be also possible to only
>>>> send IO of DS-B through MDS but keep the inflight IO to DS-A and DS-C valid and not resend.
>>>>
>>>>> -->Andy
>>>>>
>>>>>> Boaz
>>>>
>>>>
>>>> Thanks
>>>> Boaz
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 14:51                   ` Boaz Harrosh
@ 2012-06-11 15:41                     ` Adamson, Andy
  2012-06-11 16:14                       ` Boaz Harrosh
  0 siblings, 1 reply; 26+ messages in thread
From: Adamson, Andy @ 2012-06-11 15:41 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Adamson, Andy, Benny Halevy, Welch, Brent, Bhamare, Sachin,
	Andy Adamson, Myklebust, Trond, <linux-nfs@vger.kernel.org>,
	Huang, James


On Jun 11, 2012, at 10:51 AM, Boaz Harrosh wrote:

> On 06/11/2012 05:21 PM, Adamson, Andy wrote:
> 
>> 
>> On Jun 11, 2012, at 10:04 AM, Benny Halevy wrote:
>> 
>>> On 2012-06-11 13:44, Boaz Harrosh wrote:
>>>> Basically my plan is to postpone the LAYOUTRETURN send to
>>>> when the segment is last referenced, if a flag was set to
>>>> do so, and make sure reference counting is done proper, so
>>>> after flag set the last IO_done will send the LAYOUTRETURN.
>>>> 
>>>> If that is OK with all I hope?
>>> 
>>> Sounds OK to me.  Thanks!
>> 
>> Sounds fine to me as well, for LAYOUTRETURN during normal behavior.
>> 
>> -->Andy
> 
> 
> I'm not sure I understood what you meant.
> 
> In the "normal behavior" we don't send a LAYOUTRETURN at all
> in the forgetful model. Which proved beneficiary in the light
> of concurrent GETs vs RETURNs.
> 
> I thought we are only taking about the error case. Do you
> mean there is a 3rd case? please explain.


The normal case of calling LAYOUTRETURN on evict inode. 

-->Andy

> 
> Thanks
> Boaz


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 15:38               ` Benny Halevy
@ 2012-06-11 15:52                 ` Adamson, Andy
  2012-06-11 16:07                   ` Boaz Harrosh
  0 siblings, 1 reply; 26+ messages in thread
From: Adamson, Andy @ 2012-06-11 15:52 UTC (permalink / raw)
  To: Benny Halevy
  Cc: Adamson, Andy, Andy Adamson, Boaz Harrosh, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>


On Jun 11, 2012, at 11:38 AM, Benny Halevy wrote:

> On 2012-06-11 18:08, Adamson, Andy wrote:
>> 
>> On Jun 11, 2012, at 5:56 AM, Benny Halevy wrote:
>> 
>>> On 2012-06-05 22:22, Andy Adamson wrote:
>>>> On Tue, Jun 5, 2012 at 10:54 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
>>>>> On 06/05/2012 04:36 PM, Adamson, Andy wrote:
>>>>> 
>>>>>> On Jun 2, 2012, at 6:51 PM, Boaz Harrosh wrote:
>>>>>>> 
>>>>>>> In objects-layout we must report all errors on layout_return. We
>>>>>>> accumulate them and report of all errors at once. So we need
>>>>>>> the return after all in flights are back. (And no new IOs are
>>>>>>> sent) Otherwise we might miss some.
>>>>>> 
>>>>> 
>>>>>> _pnfs_retrun_layout removes all layouts, and should therefore only be
>>>>>> called once.
>>>>> 
>>>>> 
>>>>> I agree current behavior is a bug, hence my apology.
>>>>> 
>>>>>> I'll can add the 'wait for all in-flight' functionality,
>>>>>> and we can switch behaviors (wait or not wait).
>>>>>> 
>>>>> 
>>>>> 
>>>>> I disagree you must "delay" the send see below.
>>>>> 
>>>>>>> Also the RFC mandates that we do not use any layout or have
>>>>>>> IOs in flight, once we return the layout.
>>>>>> 
>>>>>> 
>>>>>> You are referring to this piece of the spec?
>>>>>> Section 18.44.3
>>>>>> 
>>>>>> After this call,
>>>>>>  the client MUST NOT use the returned layout(s) and the associated
>>>>>>  storage protocol to access the file data.
>>>>>> 
>>>>>> 
>>>>> 
>>>>>> The above says that the client MUST NOT send any _new_ i/o using the
>>>>>> layout. I don't see any reference to in-flight i/o,
>>> 
>>> Our assumption when designing the objects layout, in particular with regards to
>>> client-based RAID was that LAYOUTRETURN quiesces in flight I/Os so that other
>>> clients or the MDS see a consistent parity-stripe state.
>> 
>> I'm not suggesting any LAYOUTRETURN behavior for objects.
>> 
>> I'm coding a LAYOUTRETURN to allow the MDS to fence a file layout data server. 
>> 
> 
> Yeah, right now pnfs_return_layout always returns the whole layout
> so this patch works.  Boaz's point (I think) was that marking the whole layout
> as returned in the generic layer will prevent returning of layout segments
> by the objects layout.

Yep. I agree that _pnfs_return_layout needs to operate on an input range, and 
should wait for all lseg references - the "normal" behavior. 

Then, after that is done,
I will write a patch to allow the file layout driver to call _pnfs_return_layout in
"fence mode" e.g returning the whole layout, and not waiting for the last lseg reference.

-->Andy
 
> 
> Benny
> 
>>> 
>>>>> 
>>>>> 
>>>>> I don't see reference to in-flight i/o either, so what does that mean?
>>>>> Yes or No? It does not it's vague. For me in-flight means "USING"
>>>> 
>>>> I shot an arrow in the air - where it lands I know not where. Am I
>>>> still using the arrow after I shoot?  If so, exaclty when do I know
>>>> that it is not in use by me?
>>> 
>>> You need to wait for in-flight I/Os to either succeed, fail (e.g. time out),
>>> or be aborted.  The non-successful cases are going to be reported by the
>>> objects layout driver so the MDS can recover from these errors.
>> 
>> The object layout driver may have this requirement, but the file layout driver does not.
>> 
>>> 
>>>> 
>>>> If my RPC's time out,
>>>> is the DS using the layout? Suppose it's not a network partition, or a
>>>> DS reboot, but just a DS under really heavy load (or for some other
>>>> reason) and does not reply within the timeout? Is the client still
>>>> using the layout?
>>>> 
>>>> So I wait for the "answer", get timeouts, and I have no more
>>>> information than if I didn't wait for the answer.  Regardless, once
>>>> the decision is made on the client to not send any more i/o using that
>>>> data server, I should let the MDS know.
>>>> 
>>>> 
>>> 
>>> Unfortunately that is too weak for client-based RAID.
>>> 
>>>>> 
>>>>> Because in-flight means half was written/read and half was not, if the
>>>>> lo_return was received in the middle then the half that came after was
>>>>> using the layout information after the Server received an lo_return,
>>>>> which clearly violates the above.
>>>> 
>>>> No it doesn't. That just means the DS is using the layout. The client
>>>> is done using the layout until it sends new i/o using the layout.
>>>> 
>>>>> 
>>>>> In any way, at this point in the code you do not have the information of
>>>>> If the RPCs are in the middle of the transfer, hence defined as in-flight.
>>>>> Or they are just inside your internal client queues and will be sent clearly
>>>>> after the lo_return which surly violates the above. (Without the need of
>>>>> explicit definition of in flight.)
>>>> 
>>>> We are past the transmit state in the RPC FSM for the errors that
>>>> trigger the LAYOUTRETURN.
>>>> 
>>>>> 
>>>>>> nor should there
>>>>>> be in the error case. I get a connection error. Did the i/o's I sent
>>>>>> get to the data server?
>>>> 
>>>> If they get to the data server, does the data server use them?! We can
>>>> never know. That is exactly why the client is no longer "using" the
>>>> layout.
>>>> 
>>> 
>>> That's fine from the objects MDS point of view.  What it needs to know
>>> is whether the DS (OSD) committed the respective I/Os.
>>> 
>>>>> 
>>>>> 
>>>>> We should always assume that statistically half was sent and half was not
>>>>> In any way we will send the all "uncertain range" again.
>>>> 
>>>> Sure. We should also let the MDS know that we are resending the
>>>> "uncertain range"  ASAP. Thus the LAYOUTRETURN.
>>>> 
>>>>> 
>>>>>> The reason to send a LAYOUTRETURN without
>>>>>> waiting for all the in-flights to return with a connection error is
>>>>>> precisely to fence any in-flight i/o because I'm resending through
>>>>>> the MDS.
>>>>>> 
>>>>> 
>>>>> 
>>>>> This is clearly in contradiction of the RFC.
>>>> 
>>>> I disagree.
>>>> 
>>>>> And is imposing a Server
>>>>> behavior that was not specified in RFC.
>>>> 
>>>> No server behavior is imposed. Just an opportunity for the server to
>>>> do the MUST stated below.
>>>> 
>>>> Section 13.6
>>>> 
>>>>  As described in Section 12.5.1, a client
>>>>  MUST NOT send an I/O to a data server for which it does not hold a
>>>>  valid layout; the data server MUST reject such an I/O.
>>>> 
>>>> 
>>> 
>>> The DS fencing requirement is for file layout only.
>> 
>> Which is what I am coding - a file layout fence using LAYOUTRETURN.
>> 
>>> 
>>>>> 
>>>>> All started IO to the specified DS will return with "connection error"
>>>>> pretty fast, right?
>>>> 
>>>> Depends on the timeout.
>>>> 
>>>>> because the first disconnect probably took a timeout, but
>>>>> once the socket identified a disconnect it will stay in that state
>>>>> until a reconnect, right.
>>>>> 
>>>>> So what are you attempting to do, Make your internal client Q drain very fast
>>>>> since you are going through MDS?
>>>> 
>>>> If by the internal client Q you mean the DS session slot_tbl_waitq,
>>>> that is a separate issue. Those RPC's are redirected internally upon
>>>> waking from the Q, they never get sent to the DS.
>>>> 
>>>> We do indeed wait for each in-flight RPC to error out before
>>>> re-sending the data of the failed RPC to the MDS.
>>>> 
>>>> Your theory that the LAYOUTRETURN we call will somehow speed up our
>>>> recovery is wrong.
>>>> 
>>>> 
>>> 
>>> It will for recovering files striped with RAID as the LAYOUTRETURN
>>> provides the server with a reliable "commit point" where it knows
>>> exactly what was written successfully to each stripe and can make
>>> the most efficient decision about recovering it (if needed).
>>> 
>>> Allowing I/O to the stripe post LAYOUTRETURN may result in data
>>> corruption due to parity inconsistency.
>> 
>> For objects.
>> 
>> -->Andy
>> 
>>> 
>>> Benny
>>> 
>>>>> But you are doing that by assuming the
>>>>> Server will fence ALL IO,
>>>> 
>>>> What? No!
>>>> 
>>>>> and not by simply aborting your own Q.
>>>> 
>>>> See above. Of course we abort/redirect our Q.
>>>> 
>>>> We choose not to lose data. We do abort any RPC/NFS/Session queues and
>>>> re-direct. Only the in-flight RPC's which we have no idea of their
>>>> success are resent _after_ getting the error.  The LAYOUTRETURN is an
>>>> indication to the MDS that all is not well.
>>>> 
>>>>> Highly unorthodox
>>>> 
>>>> I'm open to suggestions. :) As I pointed out above, the only reason to
>>>> send the LAYOUTRETURN is to let the MDS know that some I/O might be
>>>> resent. Once the server gets the returned layout, it MUST reject any
>>>> I/O using that layout. (section 13.6).
>>>> 
>>>>> and certainly in violation of above.
>>>> 
>>>> I disagree.
>>>> 
>>>> -->Andy
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> I was under the impression that only when the last reference
>>>>>>> on a layout is dropped only then we send the lo_return.
>>>>>> 
>>>>>>> If it is not so, this is the proper fix.
>>>>>> 
>>>>>>> 
>>>>>>> 1. Mark LO invalid so all new IO waits or goes to MDS
>>>>>>> 3. When LO ref drops to Zero send the lo_return.
>>>>>> 
>>>>>> Yes for the normal case (evict inode for the file layout),
>>>>> 
>>>>>> but not if I'm in an error situation and I want to fence the DS from in-flight i/o.
>>>>> 
>>>>> A client has no way stated in the protocol to cause a "fence". This is just your
>>>>> wishful thinking. lo_return is something else, lo_return means I'm no longer using
>>>>> the file, not "please protect me from myself, because I will send more IO after that,
>>>>> but please ignore it"
>>>>> 
>>>>> All you can do is abort your own client Q, all these RPCs that did not get sent
>>>>> or errored-out will be resent through MDS, there might be half an RPC of overlap.
>>>>> 
>>>>> Think of 4.2 when you will need to report these errors to MDS - on lo_return -
>>>>> Your code will not work.
>>>>> 
>>>>> Lets backtrack a second. Let me see if I understand what is your trick:
>>>>> 1. Say we have a file with a files-layout of which device_id specifies 3 DSs.
>>>>> 2. And say I have a large IO generated by an app to that file.
>>>>> 3. In the middle of the IO, one of the DSs, say DS-B, returns with a "connection error"
>>>>> 4 The RPCs stripe_units generated to DS-B will all quickly return with "connection error"
>>>>> after the first error.
>>>>> 5. But the RPCs to DS-A and DS-C will continue to IO. Until done.
>>>>> 
>>>>> But since you will send the entire range of IO through MDS you want that DS-A, DS-C to
>>>>> reject any farther IO, and any RPCs in client Qs for DS-A and DS-C will return quickly
>>>>> with "io rejected".
>>>>> 
>>>>> Is that your idea?
>>>>> 
>>>>>> 
>>>>>>> 4. After LAYOUTRETURN_DONE is back, re-enable layouts.
>>>>>>> I'm so sorry it is not so today. I should have tested for
>>>>>>> this. I admit that all my error injection tests are
>>>>>>> single-file single thread so I did not test for this.
>>>>>>> 
>>>>>>> Sigh, work never ends. Tell me if I can help with this
>>>>>> 
>>>>>> I'll add the wait/no-wait…
>>>>>> 
>>>>> 
>>>>> 
>>>>> I would not want a sync wait at all. Please don't code any sync wait for me. It will not
>>>>> work for objects, and will produce dead-locks.
>>>>> 
>>>>> What I need is that a flag is raised on the lo_seg, and when last ref on the
>>>>> lo_seg drops an lo_return is sent. So either the call to layout_return causes
>>>>> the lo_return, or the last io_done of the inflights will cause the send.
>>>>> (You see io_done is the one that calls layout_return in the first place)
>>>>> 
>>>>> !! BUT Please do not do any waits at all for my sake, because this is a sure dead-lock !!
>>>>> 
>>>>> And if you ask me It's the way you want it too, Because that's the RFC, and that's
>>>>> what you'll need for 4.2.
>>>>> 
>>>>> And perhaps it should not be that hard to implement a Q abort, and not need that
>>>>> unorthodox fencing which is not specified in the RFC. And it might be also possible to only
>>>>> send IO of DS-B through MDS but keep the inflight IO to DS-A and DS-C valid and not resend.
>>>>> 
>>>>>> -->Andy
>>>>>> 
>>>>>>> Boaz
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Boaz
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 15:08             ` Adamson, Andy
  2012-06-11 15:38               ` Benny Halevy
@ 2012-06-11 15:59               ` Boaz Harrosh
  1 sibling, 0 replies; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-11 15:59 UTC (permalink / raw)
  To: Adamson, Andy
  Cc: Benny Halevy, Andy Adamson, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>

On 06/11/2012 06:08 PM, Adamson, Andy wrote:

>> You need to wait for in-flight I/Os to either succeed, fail (e.g. time out),
>> or be aborted.  The non-successful cases are going to be reported by the
>> objects layout driver so the MDS can recover from these errors.
> 

> The object layout driver may have this requirement, but the file
> layout driver does not.


I do not see the different texts In that matter. All I see is that
for objects all the core-pnfs wording applies + the error reporting
as part of LAYOUTRETURN.

Specifically I do not see any exemption of files layout about using
layouts (segments) after a LAYOUTRETURN was sent.

And I do not agree with Benny in regard of fencing off. Objects has,
and is specifically specified, a very robust mechanism of "fencing off"
just as files has.

And I do not see anywhere, where it is permitted to the client to
send new RPCs using the same (old) layout_segment after it was returned.
Which what your patch does, because there is a race between pnfs_layout_return()
and paglist_read/write.

What is so wrong with waiting for layout_segment reference to drop
to zero? You still did not explain. What are you trying to solve
by sending a LAYOUTRETURN, before layout_segment is all released?
Please explain?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 15:52                 ` Adamson, Andy
@ 2012-06-11 16:07                   ` Boaz Harrosh
  0 siblings, 0 replies; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-11 16:07 UTC (permalink / raw)
  To: Adamson, Andy
  Cc: Benny Halevy, Andy Adamson, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>

On 06/11/2012 06:52 PM, Adamson, Andy wrote:
<>

>>
>> Yeah, right now pnfs_return_layout always returns the whole layout
>> so this patch works.  Boaz's point (I think) was that marking the whole layout
>> as returned in the generic layer will prevent returning of layout segments
>> by the objects layout.
> 
> Yep. I agree that _pnfs_return_layout needs to operate on an input range, and 
> should wait for all lseg references - the "normal" behavior. 
> 


You mean the normal "error handling behavior" right?

> Then, after that is done,
> I will write a patch to allow the file layout driver to call _pnfs_return_layout in
> "fence mode" e.g returning the whole layout, and not waiting for the last lseg reference.
> 


This one I do not understand. And am really trying. please be slow with me?
1. What does it means "whole layout" Is that just a short hand for
   "all segments" ?
2. What is "fence mode" LAYOUTRETURN. where did you take this one from?
3. How does "fence mode" permits a client to use a layout segment after
   the send of LAYOUTRETURN. Since that is what we do (to date) because
   of race.

Thanks
Boaz

> -->Andy



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] NFSv4.1 mark layout when already returned
  2012-06-11 15:41                     ` Adamson, Andy
@ 2012-06-11 16:14                       ` Boaz Harrosh
  0 siblings, 0 replies; 26+ messages in thread
From: Boaz Harrosh @ 2012-06-11 16:14 UTC (permalink / raw)
  To: Adamson, Andy
  Cc: Benny Halevy, Welch, Brent, Bhamare, Sachin, Andy Adamson,
	Myklebust, Trond, <linux-nfs@vger.kernel.org>,
	Huang, James

On 06/11/2012 06:41 PM, Adamson, Andy wrote:

> On Jun 11, 2012, at 10:51 AM, Boaz Harrosh wrote:
>> I'm not sure I understood what you meant.
>>
>> In the "normal behavior" we don't send a LAYOUTRETURN at all
>> in the forgetful model. Which proved beneficiary in the light
>> of concurrent GETs vs RETURNs.
>>
>> I thought we are only taking about the error case. Do you
>> mean there is a 3rd case? please explain.
> 
> The normal case of calling LAYOUTRETURN on evict inode. 
> 


evict LAYOUTRETURN is mute because evict() means inode has no more
references, which specifically means no more layout_segments which
hold a reference on the inode.

The interesting part is the error-handling case, on one of the
layout_segments. What are we aloud to do after we send the
LAYOUTRETURN to MDS.

It's what I'm talking about, and is a part of of the general
error-reporting in 4.2 to come.

Boaz

> -->Andy
> 
>>
>> Thanks
>> Boaz
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-06-11 16:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-01 17:19 [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs andros
2012-06-01 17:19 ` [PATCH 2/3] NFSv4.1 mark layout when already returned andros
2012-06-02 22:51   ` Boaz Harrosh
2012-06-05 13:36     ` Adamson, Andy
2012-06-05 13:47       ` Andy Adamson
2012-06-05 14:54       ` Boaz Harrosh
2012-06-05 19:22         ` Andy Adamson
2012-06-05 20:49           ` Boaz Harrosh
2012-06-11  9:56           ` Benny Halevy
2012-06-11 10:44             ` Boaz Harrosh
2012-06-11 14:04               ` Benny Halevy
2012-06-11 14:21                 ` Adamson, Andy
2012-06-11 14:51                   ` Boaz Harrosh
2012-06-11 15:41                     ` Adamson, Andy
2012-06-11 16:14                       ` Boaz Harrosh
2012-06-11 15:08             ` Adamson, Andy
2012-06-11 15:38               ` Benny Halevy
2012-06-11 15:52                 ` Adamson, Andy
2012-06-11 16:07                   ` Boaz Harrosh
2012-06-11 15:59               ` Boaz Harrosh
2012-06-01 17:19 ` [PATCH 3/3] NFSv4.1 fence all layouts with file layout data server connection errors andros
2012-06-02 22:33   ` Boaz Harrosh
2012-06-04 13:49     ` Adamson, Andy
2012-06-02 23:00 ` [PATCH 1/3] NFSv4.1 do not call LAYOUTRETURN when there are no legs Boaz Harrosh
2012-06-05 13:36   ` Adamson, Andy
2012-06-05 15:01     ` Boaz Harrosh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.