All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
@ 2012-07-01  1:53 Charles 'Boyo
  2012-07-02 14:09 ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Charles 'Boyo @ 2012-07-01  1:53 UTC (permalink / raw)
  To: linux-nfs

Hello.

I have repeatedly had Linux NFS clients hang while trying to access
files on a NFSv4 mount (Solaris).
Investigations revealed that this client is using a delegation that it
has already returned, resulting in the BAD_STATEID error.
Unfortunately, it then proceeds to hammer the server with these
"doomed" requests, resulting in the client-side unresponsiveness and
constant network traffic.

A sample trace can be found at http://pastebin.centos.org/39046
As shown, the READ in frame 10 (line 112) follows the DELEGRETURN in
frame 9 which results in the error. This READ was then repeated
infinitely until either the server or client was restarted.
Disabling delegations on the server-side caused the problem to cease.
So what is wrong with delegations on the client-side?

I am using the latest nfs-utils packages and my mount options are as
shown below:

# cat /etc/redhat-release
CentOS release 6.2 (Final)

# uname -r
2.6.32-220.4.1.el6.x86_64

# rpm -qa '*nfs*'
nfs-utils-lib-1.1.5-4.el6.x86_64
nfs-utils-1.2.3-15.el6_2.1.x86_64

# grep nfs4 /proc/mounts
10.51.1.6:/SharedFolder/ /var/LocalMountPoint nfs4
rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.51.1.34,minorversion=0,local_lock=none,addr=10.51.1.6
0 0

Regards,

Charles

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-01  1:53 Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID) Charles 'Boyo
@ 2012-07-02 14:09 ` Chuck Lever
  2012-07-02 20:22   ` Charles 'Boyo
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2012-07-02 14:09 UTC (permalink / raw)
  To: Charles 'Boyo; +Cc: linux-nfs


On Jun 30, 2012, at 9:53 PM, Charles 'Boyo wrote:

> Hello.
> 
> I have repeatedly had Linux NFS clients hang while trying to access
> files on a NFSv4 mount (Solaris).
> Investigations revealed that this client is using a delegation that it
> has already returned, resulting in the BAD_STATEID error.
> Unfortunately, it then proceeds to hammer the server with these
> "doomed" requests, resulting in the client-side unresponsiveness and
> constant network traffic.
> 
> A sample trace can be found at http://pastebin.centos.org/39046
> As shown, the READ in frame 10 (line 112) follows the DELEGRETURN in
> frame 9 which results in the error. This READ was then repeated
> infinitely until either the server or client was restarted.
> Disabling delegations on the server-side caused the problem to cease.
> So what is wrong with delegations on the client-side?

Usually we see this behavior because of a race between an OPEN with delegation and a delegation recall.  In this case, however, the client is actively returning a READ delegation, then proceeding to use it anyway.  I don't see the server's recall callback, though, and there are other indications that this trace is not complete.  So it's hard to be 100% confident.

As far as I know, the EL6.2 client does not have support for recovering a single bad STATEID, which is why it is looping.  That support is available in mainline kernels 3.0 and later.

However, it seems to me that it is a bug for the client to continue using a delegation that it has returned.

You have already found one work-around: disable delegations on the NFS server.  Or you could mount with NFSv3.  Or, if feasible, your application could be modified to use fcntl() locking.


> I am using the latest nfs-utils packages and my mount options are as
> shown below:
> 
> # cat /etc/redhat-release
> CentOS release 6.2 (Final)
> 
> # uname -r
> 2.6.32-220.4.1.el6.x86_64
> 
> # rpm -qa '*nfs*'
> nfs-utils-lib-1.1.5-4.el6.x86_64
> nfs-utils-1.2.3-15.el6_2.1.x86_64
> 
> # grep nfs4 /proc/mounts
> 10.51.1.6:/SharedFolder/ /var/LocalMountPoint nfs4
> rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.51.1.34,minorversion=0,local_lock=none,addr=10.51.1.6
> 0 0
> 
> Regards,
> 
> Charles
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-02 14:09 ` Chuck Lever
@ 2012-07-02 20:22   ` Charles 'Boyo
  2012-07-02 20:35     ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Charles 'Boyo @ 2012-07-02 20:22 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

On Mon, Jul 2, 2012 at 3:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> Usually we see this behavior because of a race between an OPEN with delegation and a delegation recall.  In this case, however, the client is actively returning a READ
> delegation, then proceeding to use it anyway.  I don't see the server's recall callback, though, and there are other indications that this trace is not complete. So it's hard
>  to be 100% confident.
>
The trace is not complete, it includes just enough information to
explain the problem.
However I can confirm the service did not send a recall callback, the
client returned the delegation of its own "free will".
>
> As far as I know, the EL6.2 client does not have support for recovering a single bad STATEID, which is why it is looping.  That support is available in mainline kernels 3.0
> and later.
>
> However, it seems to me that it is a bug for the client to continue using a delegation that it has returned.
>
Is it possible is a scheduling issue of some sort, where the READ
should have been sent ahead of the DELEGRETURN but somehow got mixed
up?
>
> You have already found one work-around: disable delegations on the NFS server.  Or you could mount with NFSv3.  Or, if feasible, your application could be modified to
> use fcntl() locking.
>
In my case, disabling delegation is the only feasible work-around.
NFSv3 creates new issues with identity mapping and the application is
closed-source.
With delegation disabled, what else do I stand to lose apart from some
client-side efficiencies? I have noticed that the client has resorted
to closing and re-opening commonly used files every few seconds -
probably an attempt to flush all data out to the server as soon as
possible. This hasn't caused me any grief, but I don't know what I'm
missing.

Regards,

Charles

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-02 20:22   ` Charles 'Boyo
@ 2012-07-02 20:35     ` Chuck Lever
  2012-07-02 21:13       ` Myklebust, Trond
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2012-07-02 20:35 UTC (permalink / raw)
  To: Charles 'Boyo; +Cc: linux-nfs


On Jul 2, 2012, at 4:22 PM, Charles 'Boyo wrote:

> On Mon, Jul 2, 2012 at 3:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> Usually we see this behavior because of a race between an OPEN with delegation and a delegation recall.  In this case, however, the client is actively returning a READ
>> delegation, then proceeding to use it anyway.  I don't see the server's recall callback, though, and there are other indications that this trace is not complete. So it's hard
>> to be 100% confident.
>> 
> The trace is not complete, it includes just enough information to
> explain the problem.
> However I can confirm the service did not send a recall callback, the
> client returned the delegation of its own "free will".

The callback would come on a separate TCP connection.  I can't think of a reason that a client would return a delegation by itself and then subsequently start to use it.

>> 
>> As far as I know, the EL6.2 client does not have support for recovering a single bad STATEID, which is why it is looping.  That support is available in mainline kernels 3.0
>> and later.
>> 
>> However, it seems to me that it is a bug for the client to continue using a delegation that it has returned.
>> 
> Is it possible is a scheduling issue of some sort, where the READ
> should have been sent ahead of the DELEGRETURN but somehow got mixed
> up?

Or possibly that the DELEGRETURN doesn't actually remove the delegation state ID until the server has replied, and the READ request was sent before the DELEGRETURN reply arrived at the client.

>> 
>> You have already found one work-around: disable delegations on the NFS server.  Or you could mount with NFSv3.  Or, if feasible, your application could be modified to
>> use fcntl() locking.
>> 
> In my case, disabling delegation is the only feasible work-around.
> NFSv3 creates new issues with identity mapping and the application is
> closed-source.
> With delegation disabled, what else do I stand to lose apart from some
> client-side efficiencies? I have noticed that the client has resorted
> to closing and re-opening commonly used files every few seconds -
> probably an attempt to flush all data out to the server as soon as
> possible.

Delegation allows the client to leave a file open and cache data more aggressively.  The extra CLOSE operations are likely due to close-to-open requirements (NFS optimizes for serial file sharing).

> This hasn't caused me any grief, but I don't know what I'm
> missing.

If you haven't noticed any troubling behavior, then there is probably not going to be a major impact for your workload.

> 
> Regards,
> 
> Charles
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-02 20:35     ` Chuck Lever
@ 2012-07-02 21:13       ` Myklebust, Trond
  2012-07-02 21:19         ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Myklebust, Trond @ 2012-07-02 21:13 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Charles 'Boyo, linux-nfs

T24gTW9uLCAyMDEyLTA3LTAyIGF0IDE2OjM1IC0wNDAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
T24gSnVsIDIsIDIwMTIsIGF0IDQ6MjIgUE0sIENoYXJsZXMgJ0JveW8gd3JvdGU6DQo+IA0KPiA+
IE9uIE1vbiwgSnVsIDIsIDIwMTIgYXQgMzowOSBQTSwgQ2h1Y2sgTGV2ZXIgPGNodWNrLmxldmVy
QG9yYWNsZS5jb20+IHdyb3RlOg0KPiA+PiANCj4gPj4gVXN1YWxseSB3ZSBzZWUgdGhpcyBiZWhh
dmlvciBiZWNhdXNlIG9mIGEgcmFjZSBiZXR3ZWVuIGFuIE9QRU4gd2l0aCBkZWxlZ2F0aW9uIGFu
ZCBhIGRlbGVnYXRpb24gcmVjYWxsLiAgSW4gdGhpcyBjYXNlLCBob3dldmVyLCB0aGUgY2xpZW50
IGlzIGFjdGl2ZWx5IHJldHVybmluZyBhIFJFQUQNCj4gPj4gZGVsZWdhdGlvbiwgdGhlbiBwcm9j
ZWVkaW5nIHRvIHVzZSBpdCBhbnl3YXkuICBJIGRvbid0IHNlZSB0aGUgc2VydmVyJ3MgcmVjYWxs
IGNhbGxiYWNrLCB0aG91Z2gsIGFuZCB0aGVyZSBhcmUgb3RoZXIgaW5kaWNhdGlvbnMgdGhhdCB0
aGlzIHRyYWNlIGlzIG5vdCBjb21wbGV0ZS4gU28gaXQncyBoYXJkDQo+ID4+IHRvIGJlIDEwMCUg
Y29uZmlkZW50Lg0KPiA+PiANCj4gPiBUaGUgdHJhY2UgaXMgbm90IGNvbXBsZXRlLCBpdCBpbmNs
dWRlcyBqdXN0IGVub3VnaCBpbmZvcm1hdGlvbiB0bw0KPiA+IGV4cGxhaW4gdGhlIHByb2JsZW0u
DQo+ID4gSG93ZXZlciBJIGNhbiBjb25maXJtIHRoZSBzZXJ2aWNlIGRpZCBub3Qgc2VuZCBhIHJl
Y2FsbCBjYWxsYmFjaywgdGhlDQo+ID4gY2xpZW50IHJldHVybmVkIHRoZSBkZWxlZ2F0aW9uIG9m
IGl0cyBvd24gImZyZWUgd2lsbCIuDQo+IA0KPiBUaGUgY2FsbGJhY2sgd291bGQgY29tZSBvbiBh
IHNlcGFyYXRlIFRDUCBjb25uZWN0aW9uLiAgSSBjYW4ndCB0aGluayBvZiBhIHJlYXNvbiB0aGF0
IGEgY2xpZW50IHdvdWxkIHJldHVybiBhIGRlbGVnYXRpb24gYnkgaXRzZWxmIGFuZCB0aGVuIHN1
YnNlcXVlbnRseSBzdGFydCB0byB1c2UgaXQuDQoNCkkgY2FuOiB0aGVyZSBhcmUgYSBudW1iZXIg
b2Ygc2VydmVycyBvdXQgdGhlcmUgdGhhdCB2aW9sYXRlIHRoZSBzcGVjIGJ5DQpyZXR1cm5pbmcg
YSBkZWxlZ2F0aW9uIGFzIHBhcnQgb2YgYW4gT1BFTihDTEFJTV9ERUxFR0FURV9DVVIpLiBVc3Vh
bGx5DQp0aG9zZSBicm9rZW4gc2VydmVycyB3aWxsIHNlbmQgdGhlIGV4YWN0IHNhbWUgc3RhdGVp
ZCBhcyB0aGUgZGVsZWdhdGlvbg0KdGhhdCBpcyBiZWluZyByZXR1cm5lZC4NCg0KVGhlIExpbnV4
IGNsaWVudCBkb2VzIG5vdCBleHBlY3QgYSBkZWxlZ2F0aW9uIHdoZW4gaXQgc2VuZHMNCkNMQUlN
X0RFTEVHQVRFX0NVUiByZXF1ZXN0cywgYW5kIGVuZHMgdXAgZ2V0dGluZyBjb25mdXNlZDsgaXQg
cmVpbnN0YXRlcw0KdGhlIGRlbGVnYXRpb24uIFRoaXMgaXMgYW4gaXNzdWUgdGhhdCB3ZSB3aWxs
IG5vdCBmaXggb24gdGhlIGNsaWVudCwNCnNpbmNlIGl0IGlzIGEgc2VydmVyIGJ1Zy4uLg0KDQpD
aGVlcnMNCiAgVHJvbmQNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1h
aW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFw
cC5jb20NCg0K

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-02 21:13       ` Myklebust, Trond
@ 2012-07-02 21:19         ` Chuck Lever
  2012-07-02 21:48           ` Myklebust, Trond
  2012-07-02 21:50           ` Charles 'Boyo
  0 siblings, 2 replies; 13+ messages in thread
From: Chuck Lever @ 2012-07-02 21:19 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Charles 'Boyo, linux-nfs


On Jul 2, 2012, at 5:13 PM, Myklebust, Trond wrote:

> On Mon, 2012-07-02 at 16:35 -0400, Chuck Lever wrote:
>> On Jul 2, 2012, at 4:22 PM, Charles 'Boyo wrote:
>> 
>>> On Mon, Jul 2, 2012 at 3:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> Usually we see this behavior because of a race between an OPEN with delegation and a delegation recall.  In this case, however, the client is actively returning a READ
>>>> delegation, then proceeding to use it anyway.  I don't see the server's recall callback, though, and there are other indications that this trace is not complete. So it's hard
>>>> to be 100% confident.
>>>> 
>>> The trace is not complete, it includes just enough information to
>>> explain the problem.
>>> However I can confirm the service did not send a recall callback, the
>>> client returned the delegation of its own "free will".
>> 
>> The callback would come on a separate TCP connection.  I can't think of a reason that a client would return a delegation by itself and then subsequently start to use it.
> 
> I can: there are a number of servers out there that violate the spec by
> returning a delegation as part of an OPEN(CLAIM_DELEGATE_CUR). Usually
> those broken servers will send the exact same stateid as the delegation
> that is being returned.

The OPEN in frame 7 is a CLAIM_NULL OPEN, isn't it?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-02 21:19         ` Chuck Lever
@ 2012-07-02 21:48           ` Myklebust, Trond
  2012-07-02 21:50             ` Chuck Lever
  2012-07-02 21:50           ` Charles 'Boyo
  1 sibling, 1 reply; 13+ messages in thread
From: Myklebust, Trond @ 2012-07-02 21:48 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Charles 'Boyo, linux-nfs

T24gTW9uLCAyMDEyLTA3LTAyIGF0IDE3OjE5IC0wNDAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
T24gSnVsIDIsIDIwMTIsIGF0IDU6MTMgUE0sIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+IA0K
PiA+IE9uIE1vbiwgMjAxMi0wNy0wMiBhdCAxNjozNSAtMDQwMCwgQ2h1Y2sgTGV2ZXIgd3JvdGU6
DQo+ID4+IE9uIEp1bCAyLCAyMDEyLCBhdCA0OjIyIFBNLCBDaGFybGVzICdCb3lvIHdyb3RlOg0K
PiA+PiANCj4gPj4+IE9uIE1vbiwgSnVsIDIsIDIwMTIgYXQgMzowOSBQTSwgQ2h1Y2sgTGV2ZXIg
PGNodWNrLmxldmVyQG9yYWNsZS5jb20+IHdyb3RlOg0KPiA+Pj4+IA0KPiA+Pj4+IFVzdWFsbHkg
d2Ugc2VlIHRoaXMgYmVoYXZpb3IgYmVjYXVzZSBvZiBhIHJhY2UgYmV0d2VlbiBhbiBPUEVOIHdp
dGggZGVsZWdhdGlvbiBhbmQgYSBkZWxlZ2F0aW9uIHJlY2FsbC4gIEluIHRoaXMgY2FzZSwgaG93
ZXZlciwgdGhlIGNsaWVudCBpcyBhY3RpdmVseSByZXR1cm5pbmcgYSBSRUFEDQo+ID4+Pj4gZGVs
ZWdhdGlvbiwgdGhlbiBwcm9jZWVkaW5nIHRvIHVzZSBpdCBhbnl3YXkuICBJIGRvbid0IHNlZSB0
aGUgc2VydmVyJ3MgcmVjYWxsIGNhbGxiYWNrLCB0aG91Z2gsIGFuZCB0aGVyZSBhcmUgb3RoZXIg
aW5kaWNhdGlvbnMgdGhhdCB0aGlzIHRyYWNlIGlzIG5vdCBjb21wbGV0ZS4gU28gaXQncyBoYXJk
DQo+ID4+Pj4gdG8gYmUgMTAwJSBjb25maWRlbnQuDQo+ID4+Pj4gDQo+ID4+PiBUaGUgdHJhY2Ug
aXMgbm90IGNvbXBsZXRlLCBpdCBpbmNsdWRlcyBqdXN0IGVub3VnaCBpbmZvcm1hdGlvbiB0bw0K
PiA+Pj4gZXhwbGFpbiB0aGUgcHJvYmxlbS4NCj4gPj4+IEhvd2V2ZXIgSSBjYW4gY29uZmlybSB0
aGUgc2VydmljZSBkaWQgbm90IHNlbmQgYSByZWNhbGwgY2FsbGJhY2ssIHRoZQ0KPiA+Pj4gY2xp
ZW50IHJldHVybmVkIHRoZSBkZWxlZ2F0aW9uIG9mIGl0cyBvd24gImZyZWUgd2lsbCIuDQo+ID4+
IA0KPiA+PiBUaGUgY2FsbGJhY2sgd291bGQgY29tZSBvbiBhIHNlcGFyYXRlIFRDUCBjb25uZWN0
aW9uLiAgSSBjYW4ndCB0aGluayBvZiBhIHJlYXNvbiB0aGF0IGEgY2xpZW50IHdvdWxkIHJldHVy
biBhIGRlbGVnYXRpb24gYnkgaXRzZWxmIGFuZCB0aGVuIHN1YnNlcXVlbnRseSBzdGFydCB0byB1
c2UgaXQuDQo+ID4gDQo+ID4gSSBjYW46IHRoZXJlIGFyZSBhIG51bWJlciBvZiBzZXJ2ZXJzIG91
dCB0aGVyZSB0aGF0IHZpb2xhdGUgdGhlIHNwZWMgYnkNCj4gPiByZXR1cm5pbmcgYSBkZWxlZ2F0
aW9uIGFzIHBhcnQgb2YgYW4gT1BFTihDTEFJTV9ERUxFR0FURV9DVVIpLiBVc3VhbGx5DQo+ID4g
dGhvc2UgYnJva2VuIHNlcnZlcnMgd2lsbCBzZW5kIHRoZSBleGFjdCBzYW1lIHN0YXRlaWQgYXMg
dGhlIGRlbGVnYXRpb24NCj4gPiB0aGF0IGlzIGJlaW5nIHJldHVybmVkLg0KPiANCj4gVGhlIE9Q
RU4gaW4gZnJhbWUgNyBpcyBhIENMQUlNX05VTEwgT1BFTiwgaXNuJ3QgaXQ/DQoNClllcy4gSXMg
dGhlcmUgYSBwcmVhbWJsZSB0byB0aGF0LCBvciBkaWQgdGhlIGNsaWVudCByZWFsbHkgdHJ5IHRv
DQpkZWxlZ3JldHVybiBpbW1lZGlhdGVseSB1cG9uIHJlY2VpdmluZyBhIGRlbGVnYXRpb24/DQoN
Ci0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0
QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-02 21:48           ` Myklebust, Trond
@ 2012-07-02 21:50             ` Chuck Lever
  0 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2012-07-02 21:50 UTC (permalink / raw)
  To: Charles 'Boyo; +Cc: Linux;NFS Mailing List, Trond Myklebust


On Jul 2, 2012, at 5:48 PM, Myklebust, Trond wrote:

> On Mon, 2012-07-02 at 17:19 -0400, Chuck Lever wrote:
>> On Jul 2, 2012, at 5:13 PM, Myklebust, Trond wrote:
>> 
>>> On Mon, 2012-07-02 at 16:35 -0400, Chuck Lever wrote:
>>>> On Jul 2, 2012, at 4:22 PM, Charles 'Boyo wrote:
>>>> 
>>>>> On Mon, Jul 2, 2012 at 3:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> Usually we see this behavior because of a race between an OPEN with delegation and a delegation recall.  In this case, however, the client is actively returning a READ
>>>>>> delegation, then proceeding to use it anyway.  I don't see the server's recall callback, though, and there are other indications that this trace is not complete. So it's hard
>>>>>> to be 100% confident.
>>>>>> 
>>>>> The trace is not complete, it includes just enough information to
>>>>> explain the problem.
>>>>> However I can confirm the service did not send a recall callback, the
>>>>> client returned the delegation of its own "free will".
>>>> 
>>>> The callback would come on a separate TCP connection.  I can't think of a reason that a client would return a delegation by itself and then subsequently start to use it.
>>> 
>>> I can: there are a number of servers out there that violate the spec by
>>> returning a delegation as part of an OPEN(CLAIM_DELEGATE_CUR). Usually
>>> those broken servers will send the exact same stateid as the delegation
>>> that is being returned.
>> 
>> The OPEN in frame 7 is a CLAIM_NULL OPEN, isn't it?
> 
> Yes. Is there a preamble to that, or did the client really try to
> delegreturn immediately upon receiving a delegation?

Charles, we would like to see an un-redacted trace, if you have one.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-02 21:19         ` Chuck Lever
  2012-07-02 21:48           ` Myklebust, Trond
@ 2012-07-02 21:50           ` Charles 'Boyo
  2012-07-02 21:59             ` Myklebust, Trond
  1 sibling, 1 reply; 13+ messages in thread
From: Charles 'Boyo @ 2012-07-02 21:50 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Myklebust, Trond, linux-nfs

On Mon, Jul 2, 2012 at 10:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Jul 2, 2012, at 5:13 PM, Myklebust, Trond wrote:
>
>> On Mon, 2012-07-02 at 16:35 -0400, Chuck Lever wrote:
>>> On Jul 2, 2012, at 4:22 PM, Charles 'Boyo wrote:
>>>
>>>> On Mon, Jul 2, 2012 at 3:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>
>>>>> Usually we see this behavior because of a race between an OPEN with delegation and a delegation recall.  In this case, however, the client is actively returning a READ
>>>>> delegation, then proceeding to use it anyway.  I don't see the server's recall callback, though, and there are other indications that this trace is not complete. So it's hard
>>>>> to be 100% confident.
>>>>>
>>>> The trace is not complete, it includes just enough information to
>>>> explain the problem.
>>>> However I can confirm the service did not send a recall callback, the
>>>> client returned the delegation of its own "free will".
>>>
>>> The callback would come on a separate TCP connection.  I can't think of a reason that a client would return a delegation by itself and then subsequently start to use it.
>>
>> I can: there are a number of servers out there that violate the spec by
>> returning a delegation as part of an OPEN(CLAIM_DELEGATE_CUR). Usually
>> those broken servers will send the exact same stateid as the delegation
>> that is being returned.
>
> The OPEN in frame 7 is a CLAIM_NULL OPEN, isn't it?
>
The OPEN in this case is a CLAIM_NULL and I have re-examined my
network dump, there was no call back from the server.
So why would the client returns a delegation voluntarily and then re-use it?

>>>> Is it possible is a scheduling issue of some sort, where the READ
>>>> should have been sent ahead of the DELEGRETURN but somehow got mixed
>>>> up?

>>> Or possibly that the DELEGRETURN doesn't actually remove the delegation state ID until the server has replied, and the READ request was sent before the DELEGRETURN
>>> reply arrived at the client.

Indeed, the READ was issued after the DELEGRETURN but before the
response to it. Is it possible to check if this is expected behavior?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-02 21:50           ` Charles 'Boyo
@ 2012-07-02 21:59             ` Myklebust, Trond
       [not found]               ` <CAPD_G3KS6unxC4qJ1YY1GM-PCbG5sXLC6-aYk6L1dq4Jd0U7MQ@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Myklebust, Trond @ 2012-07-02 21:59 UTC (permalink / raw)
  To: Charles 'Boyo; +Cc: Chuck Lever, linux-nfs

T24gTW9uLCAyMDEyLTA3LTAyIGF0IDIyOjUwICswMTAwLCBDaGFybGVzICdCb3lvIHdyb3RlOg0K
PiBPbiBNb24sIEp1bCAyLCAyMDEyIGF0IDEwOjE5IFBNLCBDaHVjayBMZXZlciA8Y2h1Y2subGV2
ZXJAb3JhY2xlLmNvbT4gd3JvdGU6DQo+ID4NCj4gPiBPbiBKdWwgMiwgMjAxMiwgYXQgNToxMyBQ
TSwgTXlrbGVidXN0LCBUcm9uZCB3cm90ZToNCj4gPg0KPiA+PiBPbiBNb24sIDIwMTItMDctMDIg
YXQgMTY6MzUgLTA0MDAsIENodWNrIExldmVyIHdyb3RlOg0KPiA+Pj4gT24gSnVsIDIsIDIwMTIs
IGF0IDQ6MjIgUE0sIENoYXJsZXMgJ0JveW8gd3JvdGU6DQo+ID4+Pg0KPiA+Pj4+IE9uIE1vbiwg
SnVsIDIsIDIwMTIgYXQgMzowOSBQTSwgQ2h1Y2sgTGV2ZXIgPGNodWNrLmxldmVyQG9yYWNsZS5j
b20+IHdyb3RlOg0KPiA+Pj4+Pg0KPiA+Pj4+PiBVc3VhbGx5IHdlIHNlZSB0aGlzIGJlaGF2aW9y
IGJlY2F1c2Ugb2YgYSByYWNlIGJldHdlZW4gYW4gT1BFTiB3aXRoIGRlbGVnYXRpb24gYW5kIGEg
ZGVsZWdhdGlvbiByZWNhbGwuICBJbiB0aGlzIGNhc2UsIGhvd2V2ZXIsIHRoZSBjbGllbnQgaXMg
YWN0aXZlbHkgcmV0dXJuaW5nIGEgUkVBRA0KPiA+Pj4+PiBkZWxlZ2F0aW9uLCB0aGVuIHByb2Nl
ZWRpbmcgdG8gdXNlIGl0IGFueXdheS4gIEkgZG9uJ3Qgc2VlIHRoZSBzZXJ2ZXIncyByZWNhbGwg
Y2FsbGJhY2ssIHRob3VnaCwgYW5kIHRoZXJlIGFyZSBvdGhlciBpbmRpY2F0aW9ucyB0aGF0IHRo
aXMgdHJhY2UgaXMgbm90IGNvbXBsZXRlLiBTbyBpdCdzIGhhcmQNCj4gPj4+Pj4gdG8gYmUgMTAw
JSBjb25maWRlbnQuDQo+ID4+Pj4+DQo+ID4+Pj4gVGhlIHRyYWNlIGlzIG5vdCBjb21wbGV0ZSwg
aXQgaW5jbHVkZXMganVzdCBlbm91Z2ggaW5mb3JtYXRpb24gdG8NCj4gPj4+PiBleHBsYWluIHRo
ZSBwcm9ibGVtLg0KPiA+Pj4+IEhvd2V2ZXIgSSBjYW4gY29uZmlybSB0aGUgc2VydmljZSBkaWQg
bm90IHNlbmQgYSByZWNhbGwgY2FsbGJhY2ssIHRoZQ0KPiA+Pj4+IGNsaWVudCByZXR1cm5lZCB0
aGUgZGVsZWdhdGlvbiBvZiBpdHMgb3duICJmcmVlIHdpbGwiLg0KPiA+Pj4NCj4gPj4+IFRoZSBj
YWxsYmFjayB3b3VsZCBjb21lIG9uIGEgc2VwYXJhdGUgVENQIGNvbm5lY3Rpb24uICBJIGNhbid0
IHRoaW5rIG9mIGEgcmVhc29uIHRoYXQgYSBjbGllbnQgd291bGQgcmV0dXJuIGEgZGVsZWdhdGlv
biBieSBpdHNlbGYgYW5kIHRoZW4gc3Vic2VxdWVudGx5IHN0YXJ0IHRvIHVzZSBpdC4NCj4gPj4N
Cj4gPj4gSSBjYW46IHRoZXJlIGFyZSBhIG51bWJlciBvZiBzZXJ2ZXJzIG91dCB0aGVyZSB0aGF0
IHZpb2xhdGUgdGhlIHNwZWMgYnkNCj4gPj4gcmV0dXJuaW5nIGEgZGVsZWdhdGlvbiBhcyBwYXJ0
IG9mIGFuIE9QRU4oQ0xBSU1fREVMRUdBVEVfQ1VSKS4gVXN1YWxseQ0KPiA+PiB0aG9zZSBicm9r
ZW4gc2VydmVycyB3aWxsIHNlbmQgdGhlIGV4YWN0IHNhbWUgc3RhdGVpZCBhcyB0aGUgZGVsZWdh
dGlvbg0KPiA+PiB0aGF0IGlzIGJlaW5nIHJldHVybmVkLg0KPiA+DQo+ID4gVGhlIE9QRU4gaW4g
ZnJhbWUgNyBpcyBhIENMQUlNX05VTEwgT1BFTiwgaXNuJ3QgaXQ/DQo+ID4NCj4gVGhlIE9QRU4g
aW4gdGhpcyBjYXNlIGlzIGEgQ0xBSU1fTlVMTCBhbmQgSSBoYXZlIHJlLWV4YW1pbmVkIG15DQo+
IG5ldHdvcmsgZHVtcCwgdGhlcmUgd2FzIG5vIGNhbGwgYmFjayBmcm9tIHRoZSBzZXJ2ZXIuDQo+
IFNvIHdoeSB3b3VsZCB0aGUgY2xpZW50IHJldHVybnMgYSBkZWxlZ2F0aW9uIHZvbHVudGFyaWx5
IGFuZCB0aGVuIHJlLXVzZSBpdD8NCj4gDQo+ID4+Pj4gSXMgaXQgcG9zc2libGUgaXMgYSBzY2hl
ZHVsaW5nIGlzc3VlIG9mIHNvbWUgc29ydCwgd2hlcmUgdGhlIFJFQUQNCj4gPj4+PiBzaG91bGQg
aGF2ZSBiZWVuIHNlbnQgYWhlYWQgb2YgdGhlIERFTEVHUkVUVVJOIGJ1dCBzb21laG93IGdvdCBt
aXhlZA0KPiA+Pj4+IHVwPw0KPiANCj4gPj4+IE9yIHBvc3NpYmx5IHRoYXQgdGhlIERFTEVHUkVU
VVJOIGRvZXNuJ3QgYWN0dWFsbHkgcmVtb3ZlIHRoZSBkZWxlZ2F0aW9uIHN0YXRlIElEIHVudGls
IHRoZSBzZXJ2ZXIgaGFzIHJlcGxpZWQsIGFuZCB0aGUgUkVBRCByZXF1ZXN0IHdhcyBzZW50IGJl
Zm9yZSB0aGUgREVMRUdSRVRVUk4NCj4gPj4+IHJlcGx5IGFycml2ZWQgYXQgdGhlIGNsaWVudC4N
Cj4gDQo+IEluZGVlZCwgdGhlIFJFQUQgd2FzIGlzc3VlZCBhZnRlciB0aGUgREVMRUdSRVRVUk4g
YnV0IGJlZm9yZSB0aGUNCj4gcmVzcG9uc2UgdG8gaXQuIElzIGl0IHBvc3NpYmxlIHRvIGNoZWNr
IGlmIHRoaXMgaXMgZXhwZWN0ZWQgYmVoYXZpb3I/DQoNClRoYXQgaXMgZXhwZWN0ZWQgYmVoYXZp
b3VyLiBXZSBkb24ndCBzZXJpYWxpc2UgUkVBRCBhbmQgREVMRUdSRVRVUk4uDQoNCldoYXQgaXNu
J3QgZXhwZWN0ZWQgYmVoYXZpb3VyIGlzIGZvciB0aGUgY2xpZW50IHRvIERFTEVHUkVUVVJODQpp
bW1lZGlhdGVseSB1cG9uIHJlY2VpdmluZyB0aGUgZGVsZWdhdGlvbi4gTm9yIGlzIGl0IGV4cGVj
dGVkIHRoYXQgdGhlDQpSRUFEIHdvdWxkIGZhaWwgdG8gcmVjb3ZlciBpbiBhbiBvcmRpbmFyeSBE
RUxFR1JFVFVSTiBzaXR1YXRpb24uDQoNClRoaXMgaXMgd2h5IEknbSBpbnRlcmVzdGVkIGluIHNl
ZWluZyB3aGF0IGhhcHBlbmVkIGJlZm9yZSB0aGUgT1BFTi4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1
c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVz
dEBuZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0KDQo=

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
       [not found]               ` <CAPD_G3KS6unxC4qJ1YY1GM-PCbG5sXLC6-aYk6L1dq4Jd0U7MQ@mail.gmail.com>
@ 2012-07-05 19:23                 ` Charles 'Boyo
  2012-07-05 19:55                   ` Myklebust, Trond
  0 siblings, 1 reply; 13+ messages in thread
From: Charles 'Boyo @ 2012-07-05 19:23 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Chuck Lever, linux-nfs

>> What isn't expected behaviour is for the client to DELEGRETURN
>> immediately upon receiving the delegation. Nor is it expected that the
>> READ would fail to recover in an ordinary DELEGRETURN situation.
>>
>> This is why I'm interested in seeing what happened before the OPEN.
>>
>
> Attached is a minimally redacted trace which contains all the NFS
> calls and responses before and through this particular DELEGRETURN
> (frame 6084).
> I am unable to provide the full packet capture due to the sensitivity
> of the data contained therein.

Trond, did the attached trace confirm the issue as suspected?

Charles

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-05 19:23                 ` Charles 'Boyo
@ 2012-07-05 19:55                   ` Myklebust, Trond
  2012-07-05 21:35                     ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Myklebust, Trond @ 2012-07-05 19:55 UTC (permalink / raw)
  To: Charles 'Boyo; +Cc: Chuck Lever, linux-nfs

T24gVGh1LCAyMDEyLTA3LTA1IGF0IDIwOjIzICswMTAwLCBDaGFybGVzICdCb3lvIHdyb3RlOg0K
PiA+PiBXaGF0IGlzbid0IGV4cGVjdGVkIGJlaGF2aW91ciBpcyBmb3IgdGhlIGNsaWVudCB0byBE
RUxFR1JFVFVSTg0KPiA+PiBpbW1lZGlhdGVseSB1cG9uIHJlY2VpdmluZyB0aGUgZGVsZWdhdGlv
bi4gTm9yIGlzIGl0IGV4cGVjdGVkIHRoYXQgdGhlDQo+ID4+IFJFQUQgd291bGQgZmFpbCB0byBy
ZWNvdmVyIGluIGFuIG9yZGluYXJ5IERFTEVHUkVUVVJOIHNpdHVhdGlvbi4NCj4gPj4NCj4gPj4g
VGhpcyBpcyB3aHkgSSdtIGludGVyZXN0ZWQgaW4gc2VlaW5nIHdoYXQgaGFwcGVuZWQgYmVmb3Jl
IHRoZSBPUEVOLg0KPiA+Pg0KPiA+DQo+ID4gQXR0YWNoZWQgaXMgYSBtaW5pbWFsbHkgcmVkYWN0
ZWQgdHJhY2Ugd2hpY2ggY29udGFpbnMgYWxsIHRoZSBORlMNCj4gPiBjYWxscyBhbmQgcmVzcG9u
c2VzIGJlZm9yZSBhbmQgdGhyb3VnaCB0aGlzIHBhcnRpY3VsYXIgREVMRUdSRVRVUk4NCj4gPiAo
ZnJhbWUgNjA4NCkuDQo+ID4gSSBhbSB1bmFibGUgdG8gcHJvdmlkZSB0aGUgZnVsbCBwYWNrZXQg
Y2FwdHVyZSBkdWUgdG8gdGhlIHNlbnNpdGl2aXR5DQo+ID4gb2YgdGhlIGRhdGEgY29udGFpbmVk
IHRoZXJlaW4uDQo+IA0KPiBUcm9uZCwgZGlkIHRoZSBhdHRhY2hlZCB0cmFjZSBjb25maXJtIHRo
ZSBpc3N1ZSBhcyBzdXNwZWN0ZWQ/DQo+IA0KPiBDaGFybGVzDQoNCk5vLiBXaGF0IGlzIGhhcHBl
bmluZyBpcyB0aGF0IHRoZSBjbGllbnQgZG9lcyBhbiBPUEVOIG9mIGZpbGUNCiJhY2NvdW50Lmlu
Zm8iLiBJdCB0aGVuIGNsb3NlcyB0aGUgZmlsZSBhbmQgZm9yIHNvbWUgcmVhc29uIHRoYXQgSSBk
b24ndA0KeWV0IHVuZGVyc3RhbmQsIGl0IGRlY2lkZXMgdG8gcmV0dXJuIHRoZSBkZWxlZ2F0aW9u
Lg0KDQpUaGVuIHRoZSBhcHBsaWNhdGlvbiBkb2VzIGFub3RoZXIgT1BFTiwgd2hpY2ggcmFjZXMg
d2l0aCB0aGUgZGVsZWdhdGlvbg0KcmV0dXJuLiBCZWNhdXNlIG9mIHRoZSByYWNlLCB0aGUgc2Vy
dmVyIHJldHVybnMgdGhlIHNhbWUgZGVsZWdhdGlvbiBhcw0KaXQgZGlkIGluIHRoZSBmaXJzdCBv
cGVuIGNhbGwuIFRoZSBjbGllbnQgKHdoaWNoIGhhcyBub3cgZm9yZ290dGVuIGFib3V0DQp0aGUg
b3JpZ2luYWwgb3Blbikgc2VlcyB3aGF0IGl0IHRoaW5rcyBpcyBhIG5ldyBkZWxlZ2F0aW9uLCBh
bmQgc28gaXQNCmFkb3B0cyBpdC4NCg0KU28gdGhpcyBleHBsYWlucyB3aHkgdGhlIFJFQURzIGVu
ZCB1cCB0cnlpbmcgdG8gdXNlIGEgcmV0dXJuZWQNCmRlbGVnYXRpb24uIFdoYXQgcmVtYWlucyB0
byBiZSBleHBsYWluZWQgaXM6DQoNCmEpIFdoeSBkaWQgdGhlIGNsaWVudCByZXR1cm4gdGhlIGRl
bGVnYXRpb24gaW4gdGhlIGZpcnN0IHBsYWNlPyBJcyBpdCBpbg0Kc29tZSBleHRyZW1lIG1lbW9y
eSBwcmVzc3VyZSBzaXR1YXRpb24sIG9yIGRvZXMgaXQgcGVyaGFwcyBoYXZlIHNvbWUNCmZ1bmt5
IHNldHRpbmcgZm9yIC9wcm9jL3N5cy92bS92ZnNfY2FjaGVfcHJlc3N1cmU/DQoNCmIpIFdoeSBk
b2Vzbid0IHRoZSByZWNvdmVyeSBzY2VuYXJpbyBraWNrIGluIGFuZCBkbyB0aGUgcmlnaHQgdGhp
bmc/DQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIN
Cg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID)
  2012-07-05 19:55                   ` Myklebust, Trond
@ 2012-07-05 21:35                     ` Chuck Lever
  0 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2012-07-05 21:35 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Charles 'Boyo, linux-nfs


On Jul 5, 2012, at 3:55 PM, Myklebust, Trond wrote:

> On Thu, 2012-07-05 at 20:23 +0100, Charles 'Boyo wrote:
>>>> What isn't expected behaviour is for the client to DELEGRETURN
>>>> immediately upon receiving the delegation. Nor is it expected that the
>>>> READ would fail to recover in an ordinary DELEGRETURN situation.
>>>> 
>>>> This is why I'm interested in seeing what happened before the OPEN.
>>>> 
>>> 
>>> Attached is a minimally redacted trace which contains all the NFS
>>> calls and responses before and through this particular DELEGRETURN
>>> (frame 6084).
>>> I am unable to provide the full packet capture due to the sensitivity
>>> of the data contained therein.
>> 
>> Trond, did the attached trace confirm the issue as suspected?
>> 
>> Charles
> 
> No. What is happening is that the client does an OPEN of file
> "account.info". It then closes the file and for some reason that I don't
> yet understand, it decides to return the delegation.
> 
> Then the application does another OPEN, which races with the delegation
> return. Because of the race, the server returns the same delegation as
> it did in the first open call. The client (which has now forgotten about
> the original open) sees what it thinks is a new delegation, and so it
> adopts it.
> 
> So this explains why the READs end up trying to use a returned
> delegation. What remains to be explained is:
> 
> a) Why did the client return the delegation in the first place? Is it in
> some extreme memory pressure situation, or does it perhaps have some
> funky setting for /proc/sys/vm/vfs_cache_pressure?
> 
> b) Why doesn't the recovery scenario kick in and do the right thing?

My guess is that EL6.2 kernels don't have the mainline patch that adds support for recovering a single bad state ID.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-05 21:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-01  1:53 Linux NFSv4 client uses returned delegation in subsequent READ resulting in hang (BAD_STATEID) Charles 'Boyo
2012-07-02 14:09 ` Chuck Lever
2012-07-02 20:22   ` Charles 'Boyo
2012-07-02 20:35     ` Chuck Lever
2012-07-02 21:13       ` Myklebust, Trond
2012-07-02 21:19         ` Chuck Lever
2012-07-02 21:48           ` Myklebust, Trond
2012-07-02 21:50             ` Chuck Lever
2012-07-02 21:50           ` Charles 'Boyo
2012-07-02 21:59             ` Myklebust, Trond
     [not found]               ` <CAPD_G3KS6unxC4qJ1YY1GM-PCbG5sXLC6-aYk6L1dq4Jd0U7MQ@mail.gmail.com>
2012-07-05 19:23                 ` Charles 'Boyo
2012-07-05 19:55                   ` Myklebust, Trond
2012-07-05 21:35                     ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.