All of lore.kernel.org
 help / color / mirror / Atom feed
* I can't get no readdir satisfaction
@ 2016-08-23 15:09 Benjamin Coddington
  2016-08-23 15:36 ` Trond Myklebust
  2016-08-23 19:36 ` J. Bruce Fields
  0 siblings, 2 replies; 19+ messages in thread
From: Benjamin Coddington @ 2016-08-23 15:09 UTC (permalink / raw)
  To: List Linux NFS Mailing

Hi linux-nfs,

311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -l'
situations") changed when nfs_readdir() decides to revalidate the
directory's mapping, which contains all the entries.  In addition to just
checking if the attribute cache has expired, it includes a check to see if
NFS_INO_INVALID_DATA is set on the directory.

Well, customers that have directories with very many dentries and that same
directory's attributes are frequently updated are now grouchy that `ls -l`
takes so long since any update of the directory causes the mapping to be
invalidated and we have to start over filling the directory's mapping.

I actually haven't put real hard thought into it yet (because often for me
that just wastes a lot of time), so I am doing the lazy thing by asking this
question:

Can we go back to just the using the attribute cache timeout, or should we
get all heuristical about readdir?

Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-23 15:09 I can't get no readdir satisfaction Benjamin Coddington
@ 2016-08-23 15:36 ` Trond Myklebust
  2016-08-23 21:21   ` Benjamin Coddington
  2016-08-23 19:36 ` J. Bruce Fields
  1 sibling, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2016-08-23 15:36 UTC (permalink / raw)
  To: Coddington Benjamin; +Cc: List Linux NFS Mailing


> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@redhat.com> wrot=
e:
>=20
> Hi linux-nfs,
>=20
> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -l'
> situations") changed when nfs_readdir() decides to revalidate the
> directory's mapping, which contains all the entries.  In addition to just
> checking if the attribute cache has expired, it includes a check to see i=
f
> NFS_INO_INVALID_DATA is set on the directory.
>=20
> Well, customers that have directories with very many dentries and that sa=
me
> directory's attributes are frequently updated are now grouchy that `ls -l=
`
> takes so long since any update of the directory causes the mapping to be
> invalidated and we have to start over filling the directory's mapping.
>=20
> I actually haven't put real hard thought into it yet (because often for m=
e
> that just wastes a lot of time), so I am doing the lazy thing by asking t=
his
> question:
>=20
> Can we go back to just the using the attribute cache timeout, or should w=
e
> get all heuristical about readdir?
>=20

We are all heuristical at this point. How are the heuristics failing?

The original problem those heuristics were designed to solve was that all t=
he stat() calls took forever to complete, since they are all synchronous; T=
igran showed some very convincing numbers for a large directory where the d=
ifference in performance was an order of magnitude improved by using readdi=
rplus instead of readdir=85

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-23 15:09 I can't get no readdir satisfaction Benjamin Coddington
  2016-08-23 15:36 ` Trond Myklebust
@ 2016-08-23 19:36 ` J. Bruce Fields
  2016-08-23 21:50   ` Benjamin Coddington
  1 sibling, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2016-08-23 19:36 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: List Linux NFS Mailing

On Tue, Aug 23, 2016 at 11:09:37AM -0400, Benjamin Coddington wrote:
> Hi linux-nfs,
> 
> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -l'
> situations") changed when nfs_readdir() decides to revalidate the
> directory's mapping, which contains all the entries.  In addition to just
> checking if the attribute cache has expired, it includes a check to see if
> NFS_INO_INVALID_DATA is set on the directory.
> 
> Well, customers that have directories with very many dentries and that same
> directory's attributes are frequently updated are now grouchy that `ls -l`
> takes so long since any update of the directory causes the mapping to be
> invalidated and we have to start over filling the directory's mapping.

Apologies, I don't understand the client's readdir implementation.  So
it really zeroes out the cookie every time it invalidates the directory
cache?

I also seem to remember it makes up its own cookies to return to users
instead of returning the server's.  Is the cookie invalidation a
consequence of that?  I don't think it should have to be.  And as long
as these two things (cache and cookie validity) are tied together, I
can't see how we're going to guarantee readdir progress.

--b.

> 
> I actually haven't put real hard thought into it yet (because often for me
> that just wastes a lot of time), so I am doing the lazy thing by asking this
> question:
> 
> Can we go back to just the using the attribute cache timeout, or should we
> get all heuristical about readdir?
> 
> Ben
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-23 15:36 ` Trond Myklebust
@ 2016-08-23 21:21   ` Benjamin Coddington
  2016-08-24 12:18     ` Trond Myklebust
  2016-08-24 13:02     ` Benjamin Coddington
  0 siblings, 2 replies; 19+ messages in thread
From: Benjamin Coddington @ 2016-08-23 21:21 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: List Linux NFS Mailing



On 23 Aug 2016, at 11:36, Trond Myklebust wrote:

>> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> Hi linux-nfs,
>>
>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls 
>> -l'
>> situations") changed when nfs_readdir() decides to revalidate the
>> directory's mapping, which contains all the entries.  In addition to 
>> just
>> checking if the attribute cache has expired, it includes a check to 
>> see if
>> NFS_INO_INVALID_DATA is set on the directory.
>>
>> Well, customers that have directories with very many dentries and 
>> that same
>> directory's attributes are frequently updated are now grouchy that 
>> `ls -l`
>> takes so long since any update of the directory causes the mapping to 
>> be
>> invalidated and we have to start over filling the directory's 
>> mapping.
>>
>> I actually haven't put real hard thought into it yet (because often 
>> for me
>> that just wastes a lot of time), so I am doing the lazy thing by 
>> asking this
>> question:
>>
>> Can we go back to just the using the attribute cache timeout, or 
>> should we
>> get all heuristical about readdir?
>>
>
> We are all heuristical at this point. How are the heuristics failing?
>
> The original problem those heuristics were designed to solve was that 
> all
> the stat() calls took forever to complete, since they are all 
> synchronous;
> Tigran showed some very convincing numbers for a large directory where 
> the
> difference in performance was an order of magnitude improved by using
> readdirplus instead of readdir…

I'll try to present a better explanation.  While `ls -l` is walking 
through
a directory repeatedly entering nfs_readdir(), a CREATE response send us
through nfs_post_op_update_inode_locked():

1531     if (S_ISDIR(inode->i_mode))
1532         invalid |= NFS_INO_INVALID_DATA;
1533     nfs_set_cache_invalid(inode, invalid);

Now, the next entry into nfs_readdir() has us do 
nfs_revalidate_mapping(),
which will do nfs_invalidate_mapping() for the directory, and so we have 
to
start over with cookie 0 sending READDIRPLUS to re-fill the directory's
mapping to get back to where we are for the current nfs_readdir().

This process repeats for every entry into nfs_readdir() if the directory
keeps getting updated, and it becomes more likely that it will be 
updated as
each pass takes longer and longer to re-fill the mapping as the current
nfs_readdir() invocation is further along.

READDIRPLUS isn't the problem, the problem is invalidating the directory
mapping in the middle of a series of getdents() if we do a CREATE.  
Also, I
think a similar thing happens if the directory's mtime or ctime is 
updated -
but in that case we set NFS_INO_INVALID_DATA because the change_attr
updates.

So, for directories with a large number of entries that updates often, 
it can
be very slow to list the directory.

Why did 311324ad1713 change nfs_readdir from

if (nfs_attribute_cache_expired(inode))
     nfs_revalidate_mapping(inode, file->f_mapping);
to

if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & 
NFS_INO_INVALID_DATA)
     nfs_revalidate_mapping(inode, file->f_mapping);

.. and can we go back to the way it was before?

OK.. I understand why -- it is more correct since if we know the 
directory has
changed, we might as well fetch the change.  Otherwise, we might be 
creating
files and then wondering why they aren't listed.

It might be nicer to not invalidate the mapping we're currently using 
for
readdir, though.  Maybe there's a way to keep the mapping for the 
currently
opened directory and invalidate it once it's closed.

I'm using v3 for all this, but I think the problem is the same for v4.

Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-23 19:36 ` J. Bruce Fields
@ 2016-08-23 21:50   ` Benjamin Coddington
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Coddington @ 2016-08-23 21:50 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: List Linux NFS Mailing

On 23 Aug 2016, at 15:36, J. Bruce Fields wrote:

> On Tue, Aug 23, 2016 at 11:09:37AM -0400, Benjamin Coddington wrote:
>> Hi linux-nfs,
>>
>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls 
>> -l'
>> situations") changed when nfs_readdir() decides to revalidate the
>> directory's mapping, which contains all the entries.  In addition to 
>> just
>> checking if the attribute cache has expired, it includes a check to 
>> see if
>> NFS_INO_INVALID_DATA is set on the directory.
>>
>> Well, customers that have directories with very many dentries and 
>> that same
>> directory's attributes are frequently updated are now grouchy that 
>> `ls -l`
>> takes so long since any update of the directory causes the mapping to 
>> be
>> invalidated and we have to start over filling the directory's 
>> mapping.
>
> Apologies, I don't understand the client's readdir implementation.  So
> it really zeroes out the cookie every time it invalidates the 
> directory
> cache?

Yes, it zeros the cookieverf, so as I understand it we have to start 
over.
But, maybe for a given open() of a directory we can just keep the same
cookieverf, even if we invalidate the mapping we can continue to return
entries where the last nfs_readdir() left off..

I don't fully understand the client's readdir implementation either..

> I also seem to remember it makes up its own cookies to return to users
> instead of returning the server's.  Is the cookie invalidation a
> consequence of that?  I don't think it should have to be.  And as long
> as these two things (cache and cookie validity) are tied together, I
> can't see how we're going to guarantee readdir progress.

For users, it looks like a "position" is used, which is like an index 
into the
entries.  After READDIR, each entry's cookie is stored in the pagecache 
in
an array indexed by this position.  So in order to resume reading the
directory at a position, we need retrieve all the prior entries from the
server to determine what cookie to start from.

.. that made sense to me.  I hope I am understanding this correctly.

Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-23 21:21   ` Benjamin Coddington
@ 2016-08-24 12:18     ` Trond Myklebust
  2016-08-24 13:15       ` Benjamin Coddington
  2016-08-24 13:56       ` J. Bruce Fields
  2016-08-24 13:02     ` Benjamin Coddington
  1 sibling, 2 replies; 19+ messages in thread
From: Trond Myklebust @ 2016-08-24 12:18 UTC (permalink / raw)
  To: Coddington Benjamin; +Cc: List Linux NFS Mailing

DQo+IE9uIEF1ZyAyMywgMjAxNiwgYXQgMTc6MjEsIEJlbmphbWluIENvZGRpbmd0b24gPGJjb2Rk
aW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPiANCj4gDQo+IA0KPiBPbiAyMyBBdWcgMjAxNiwgYXQg
MTE6MzYsIFRyb25kIE15a2xlYnVzdCB3cm90ZToNCj4gDQo+Pj4gT24gQXVnIDIzLCAyMDE2LCBh
dCAxMTowOSwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRpbmdAcmVkaGF0LmNvbT4gd3JvdGU6
DQo+Pj4gDQo+Pj4gSGkgbGludXgtbmZzLA0KPj4+IA0KPj4+IDMxMTMyNGFkMTcxMyAoIk5GUzog
QmUgbW9yZSBhZ2dyZXNzaXZlIGluIHVzaW5nIHJlYWRkaXJwbHVzIGZvciAnbHMgLWwnDQo+Pj4g
c2l0dWF0aW9ucyIpIGNoYW5nZWQgd2hlbiBuZnNfcmVhZGRpcigpIGRlY2lkZXMgdG8gcmV2YWxp
ZGF0ZSB0aGUNCj4+PiBkaXJlY3RvcnkncyBtYXBwaW5nLCB3aGljaCBjb250YWlucyBhbGwgdGhl
IGVudHJpZXMuICBJbiBhZGRpdGlvbiB0byBqdXN0DQo+Pj4gY2hlY2tpbmcgaWYgdGhlIGF0dHJp
YnV0ZSBjYWNoZSBoYXMgZXhwaXJlZCwgaXQgaW5jbHVkZXMgYSBjaGVjayB0byBzZWUgaWYNCj4+
PiBORlNfSU5PX0lOVkFMSURfREFUQSBpcyBzZXQgb24gdGhlIGRpcmVjdG9yeS4NCj4+PiANCj4+
PiBXZWxsLCBjdXN0b21lcnMgdGhhdCBoYXZlIGRpcmVjdG9yaWVzIHdpdGggdmVyeSBtYW55IGRl
bnRyaWVzIGFuZCB0aGF0IHNhbWUNCj4+PiBkaXJlY3RvcnkncyBhdHRyaWJ1dGVzIGFyZSBmcmVx
dWVudGx5IHVwZGF0ZWQgYXJlIG5vdyBncm91Y2h5IHRoYXQgYGxzIC1sYA0KPj4+IHRha2VzIHNv
IGxvbmcgc2luY2UgYW55IHVwZGF0ZSBvZiB0aGUgZGlyZWN0b3J5IGNhdXNlcyB0aGUgbWFwcGlu
ZyB0byBiZQ0KPj4+IGludmFsaWRhdGVkIGFuZCB3ZSBoYXZlIHRvIHN0YXJ0IG92ZXIgZmlsbGlu
ZyB0aGUgZGlyZWN0b3J5J3MgbWFwcGluZy4NCj4+PiANCj4+PiBJIGFjdHVhbGx5IGhhdmVuJ3Qg
cHV0IHJlYWwgaGFyZCB0aG91Z2h0IGludG8gaXQgeWV0IChiZWNhdXNlIG9mdGVuIGZvciBtZQ0K
Pj4+IHRoYXQganVzdCB3YXN0ZXMgYSBsb3Qgb2YgdGltZSksIHNvIEkgYW0gZG9pbmcgdGhlIGxh
enkgdGhpbmcgYnkgYXNraW5nIHRoaXMNCj4+PiBxdWVzdGlvbjoNCj4+PiANCj4+PiBDYW4gd2Ug
Z28gYmFjayB0byBqdXN0IHRoZSB1c2luZyB0aGUgYXR0cmlidXRlIGNhY2hlIHRpbWVvdXQsIG9y
IHNob3VsZCB3ZQ0KPj4+IGdldCBhbGwgaGV1cmlzdGljYWwgYWJvdXQgcmVhZGRpcj8NCj4+PiAN
Cj4+IA0KPj4gV2UgYXJlIGFsbCBoZXVyaXN0aWNhbCBhdCB0aGlzIHBvaW50LiBIb3cgYXJlIHRo
ZSBoZXVyaXN0aWNzIGZhaWxpbmc/DQo+PiANCj4+IFRoZSBvcmlnaW5hbCBwcm9ibGVtIHRob3Nl
IGhldXJpc3RpY3Mgd2VyZSBkZXNpZ25lZCB0byBzb2x2ZSB3YXMgdGhhdCBhbGwNCj4+IHRoZSBz
dGF0KCkgY2FsbHMgdG9vayBmb3JldmVyIHRvIGNvbXBsZXRlLCBzaW5jZSB0aGV5IGFyZSBhbGwg
c3luY2hyb25vdXM7DQo+PiBUaWdyYW4gc2hvd2VkIHNvbWUgdmVyeSBjb252aW5jaW5nIG51bWJl
cnMgZm9yIGEgbGFyZ2UgZGlyZWN0b3J5IHdoZXJlIHRoZQ0KPj4gZGlmZmVyZW5jZSBpbiBwZXJm
b3JtYW5jZSB3YXMgYW4gb3JkZXIgb2YgbWFnbml0dWRlIGltcHJvdmVkIGJ5IHVzaW5nDQo+PiBy
ZWFkZGlycGx1cyBpbnN0ZWFkIG9mIHJlYWRkaXLigKYNCj4gDQo+IEknbGwgdHJ5IHRvIHByZXNl
bnQgYSBiZXR0ZXIgZXhwbGFuYXRpb24uICBXaGlsZSBgbHMgLWxgIGlzIHdhbGtpbmcgdGhyb3Vn
aA0KPiBhIGRpcmVjdG9yeSByZXBlYXRlZGx5IGVudGVyaW5nIG5mc19yZWFkZGlyKCksIGEgQ1JF
QVRFIHJlc3BvbnNlIHNlbmQgdXMNCj4gdGhyb3VnaCBuZnNfcG9zdF9vcF91cGRhdGVfaW5vZGVf
bG9ja2VkKCk6DQo+IA0KPiAxNTMxICAgICBpZiAoU19JU0RJUihpbm9kZS0+aV9tb2RlKSkNCj4g
MTUzMiAgICAgICAgIGludmFsaWQgfD0gTkZTX0lOT19JTlZBTElEX0RBVEE7DQo+IDE1MzMgICAg
IG5mc19zZXRfY2FjaGVfaW52YWxpZChpbm9kZSwgaW52YWxpZCk7DQo+IA0KPiBOb3csIHRoZSBu
ZXh0IGVudHJ5IGludG8gbmZzX3JlYWRkaXIoKSBoYXMgdXMgZG8gbmZzX3JldmFsaWRhdGVfbWFw
cGluZygpLA0KPiB3aGljaCB3aWxsIGRvIG5mc19pbnZhbGlkYXRlX21hcHBpbmcoKSBmb3IgdGhl
IGRpcmVjdG9yeSwgYW5kIHNvIHdlIGhhdmUgdG8NCj4gc3RhcnQgb3ZlciB3aXRoIGNvb2tpZSAw
IHNlbmRpbmcgUkVBRERJUlBMVVMgdG8gcmUtZmlsbCB0aGUgZGlyZWN0b3J5J3MNCj4gbWFwcGlu
ZyB0byBnZXQgYmFjayB0byB3aGVyZSB3ZSBhcmUgZm9yIHRoZSBjdXJyZW50IG5mc19yZWFkZGly
KCkuDQo+IA0KPiBUaGlzIHByb2Nlc3MgcmVwZWF0cyBmb3IgZXZlcnkgZW50cnkgaW50byBuZnNf
cmVhZGRpcigpIGlmIHRoZSBkaXJlY3RvcnkNCj4ga2VlcHMgZ2V0dGluZyB1cGRhdGVkLCBhbmQg
aXQgYmVjb21lcyBtb3JlIGxpa2VseSB0aGF0IGl0IHdpbGwgYmUgdXBkYXRlZCBhcw0KPiBlYWNo
IHBhc3MgdGFrZXMgbG9uZ2VyIGFuZCBsb25nZXIgdG8gcmUtZmlsbCB0aGUgbWFwcGluZyBhcyB0
aGUgY3VycmVudA0KPiBuZnNfcmVhZGRpcigpIGludm9jYXRpb24gaXMgZnVydGhlciBhbG9uZy4N
Cj4gDQo+IFJFQURESVJQTFVTIGlzbid0IHRoZSBwcm9ibGVtLCB0aGUgcHJvYmxlbSBpcyBpbnZh
bGlkYXRpbmcgdGhlIGRpcmVjdG9yeQ0KPiBtYXBwaW5nIGluIHRoZSBtaWRkbGUgb2YgYSBzZXJp
ZXMgb2YgZ2V0ZGVudHMoKSBpZiB3ZSBkbyBhIENSRUFURS4gIEFsc28sIEkNCj4gdGhpbmsgYSBz
aW1pbGFyIHRoaW5nIGhhcHBlbnMgaWYgdGhlIGRpcmVjdG9yeSdzIG10aW1lIG9yIGN0aW1lIGlz
IHVwZGF0ZWQgLQ0KPiBidXQgaW4gdGhhdCBjYXNlIHdlIHNldCBORlNfSU5PX0lOVkFMSURfREFU
QSBiZWNhdXNlIHRoZSBjaGFuZ2VfYXR0cg0KPiB1cGRhdGVzLg0KPiANCj4gU28sIGZvciBkaXJl
Y3RvcmllcyB3aXRoIGEgbGFyZ2UgbnVtYmVyIG9mIGVudHJpZXMgdGhhdCB1cGRhdGVzIG9mdGVu
LCBpdCBjYW4NCj4gYmUgdmVyeSBzbG93IHRvIGxpc3QgdGhlIGRpcmVjdG9yeS4NCj4gDQo+IFdo
eSBkaWQgMzExMzI0YWQxNzEzIGNoYW5nZSBuZnNfcmVhZGRpciBmcm9tDQo+IA0KPiBpZiAobmZz
X2F0dHJpYnV0ZV9jYWNoZV9leHBpcmVkKGlub2RlKSkNCj4gICAgbmZzX3JldmFsaWRhdGVfbWFw
cGluZyhpbm9kZSwgZmlsZS0+Zl9tYXBwaW5nKTsNCj4gdG8NCj4gDQo+IGlmIChuZnNfYXR0cmli
dXRlX2NhY2hlX2V4cGlyZWQoaW5vZGUpIHx8IG5mc2ktPmNhY2hlX3ZhbGlkaXR5ICYgTkZTX0lO
T19JTlZBTElEX0RBVEEpDQo+ICAgIG5mc19yZXZhbGlkYXRlX21hcHBpbmcoaW5vZGUsIGZpbGUt
PmZfbWFwcGluZyk7DQoNCkFzIHRoZSBjb21taXQgbWVzc2FnZSBzYXlzLCB0aGUgd2hvbGUgcHVy
cG9zZSB3YXMgdG8gdXNlIFJFQURESVJQTFVTIGFzIGEgc3Vic3RpdHV0ZSBmb3IgbXVsdGlwbGUg
R0VUQVRUUiBjYWxscyB3aGVuIHRoZSBoZXVyaXN0aWMgdGVsbHMgdXMgdGhhdCB0aGUgdXNlciBp
cyBwZXJmb3JtaW5nIGFuIOKAmGxzIC1s4oCZIHN0eWxlIG9mIHdvcmtsb2FkLg0KDQo+IA0KPiAu
LiBhbmQgY2FuIHdlIGdvIGJhY2sgdG8gdGhlIHdheSBpdCB3YXMgYmVmb3JlPw0KDQpOb3Qgd2l0
aG91dCBzbG93aW5nIGRvd24g4oCYbHMgLWzigJkgb24gbGFyZ2UgZGlyZWN0b3JpZXMuDQoNCj4g
DQo+IE9LLi4gSSB1bmRlcnN0YW5kIHdoeSAtLSBpdCBpcyBtb3JlIGNvcnJlY3Qgc2luY2UgaWYg
d2Uga25vdyB0aGUgZGlyZWN0b3J5IGhhcw0KPiBjaGFuZ2VkLCB3ZSBtaWdodCBhcyB3ZWxsIGZl
dGNoIHRoZSBjaGFuZ2UuICBPdGhlcndpc2UsIHdlIG1pZ2h0IGJlIGNyZWF0aW5nDQo+IGZpbGVz
IGFuZCB0aGVuIHdvbmRlcmluZyB3aHkgdGhleSBhcmVuJ3QgbGlzdGVkLg0KPiANCj4gSXQgbWln
aHQgYmUgbmljZXIgdG8gbm90IGludmFsaWRhdGUgdGhlIG1hcHBpbmcgd2UncmUgY3VycmVudGx5
IHVzaW5nIGZvcg0KPiByZWFkZGlyLCB0aG91Z2guICBNYXliZSB0aGVyZSdzIGEgd2F5IHRvIGtl
ZXAgdGhlIG1hcHBpbmcgZm9yIHRoZSBjdXJyZW50bHkNCj4gb3BlbmVkIGRpcmVjdG9yeSBhbmQg
aW52YWxpZGF0ZSBpdCBvbmNlIGl0J3MgY2xvc2VkLg0KDQogUE9TSVggcmVxdWlyZXMgdGhhdCB5
b3UgcmV2YWxpZGF0ZSBvbiBvcGVuZGlyKCkgYW5kIHJld2luZGRpcigpLCBhbmQgbGVhdmVzIGJl
aGF2aW91ciB3LnIudC4gZmlsZSBhZGRpdGlvbiBhbmQgcmVtb3ZhbCBhZnRlciB0aGUgY2FsbCB0
byBvcGVuZGlyKCkvcmV3aW5kZGlyKCkgdW5kZWZpbmVkIChodHRwOi8vcHVicy5vcGVuZ3JvdXAu
b3JnL29ubGluZXB1YnMvOTY5OTkxOTc5OS9mdW5jdGlvbnMvcmVhZGRpci5odG1sKS4gSSBiZWxp
ZXZlIG1vc3QgZmlsZXN5c3RlbXMgdW5kZXIgTGludXggd2lsbCBlbnN1cmUgdGhhdCBpZiBhIGZp
bGUgaXMgcmVtb3ZlZCwgaXQgaXMgbm8gbG9uZ2VyIHNlZW4gaW4gdGhlIHJlYWRkaXIgc3RyZWFt
LCBhbmQgYSBmaWxlIHRoYXQgaXMgYWRkZWQgd2lsbCBiZSBzZWVuIHVubGVzcyB0aGUgcmVhZGRp
ciBjdXJzb3IgaGFzIGFscmVhZHkgcGFzc2VkIGl0LCBzbyB0aGVyZSBpcyB0aGF0IHRvIGNvbnNp
ZGVyLg0KDQpIb3dldmVyIGNvcnJlY3RuZXNzIHdhcyBub3QgdGhlIG1haW4gaXNzdWUgaGVyZS4N
Cg0K

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-23 21:21   ` Benjamin Coddington
  2016-08-24 12:18     ` Trond Myklebust
@ 2016-08-24 13:02     ` Benjamin Coddington
  1 sibling, 0 replies; 19+ messages in thread
From: Benjamin Coddington @ 2016-08-24 13:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: List Linux NFS Mailing

On 23 Aug 2016, at 17:21, Benjamin Coddington wrote:

> On 23 Aug 2016, at 11:36, Trond Myklebust wrote:
>
>>> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@redhat.com> 
>>> wrote:
>>>
>>> Hi linux-nfs,
>>>
>>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls 
>>> -l'
>>> situations") changed when nfs_readdir() decides to revalidate the
>>> directory's mapping, which contains all the entries.  In addition to 
>>> just
>>> checking if the attribute cache has expired, it includes a check to 
>>> see if
>>> NFS_INO_INVALID_DATA is set on the directory.
>>>
>>> Well, customers that have directories with very many dentries and 
>>> that same
>>> directory's attributes are frequently updated are now grouchy that 
>>> `ls -l`
>>> takes so long since any update of the directory causes the mapping 
>>> to be
>>> invalidated and we have to start over filling the directory's 
>>> mapping.
>>>
>>> I actually haven't put real hard thought into it yet (because often 
>>> for me
>>> that just wastes a lot of time), so I am doing the lazy thing by 
>>> asking this
>>> question:
>>>
>>> Can we go back to just the using the attribute cache timeout, or 
>>> should we
>>> get all heuristical about readdir?
>>>
>>
>> We are all heuristical at this point. How are the heuristics failing?
>>
>> The original problem those heuristics were designed to solve was that 
>> all
>> the stat() calls took forever to complete, since they are all 
>> synchronous;
>> Tigran showed some very convincing numbers for a large directory 
>> where the
>> difference in performance was an order of magnitude improved by using
>> readdirplus instead of readdir…
>
> I'll try to present a better explanation.  While `ls -l` is walking 
> through
> a directory repeatedly entering nfs_readdir(), a CREATE response send 
> us
> through nfs_post_op_update_inode_locked():
>
> 1531     if (S_ISDIR(inode->i_mode))
> 1532         invalid |= NFS_INO_INVALID_DATA;
> 1533     nfs_set_cache_invalid(inode, invalid);
>
> Now, the next entry into nfs_readdir() has us do 
> nfs_revalidate_mapping(),
> which will do nfs_invalidate_mapping() for the directory, and so we 
> have to
> start over with cookie 0 sending READDIRPLUS to re-fill the 
> directory's
> mapping to get back to where we are for the current nfs_readdir().
>
> This process repeats for every entry into nfs_readdir() if the 
> directory
> keeps getting updated, and it becomes more likely that it will be 
> updated as
> each pass takes longer and longer to re-fill the mapping as the 
> current
> nfs_readdir() invocation is further along.
>
> READDIRPLUS isn't the problem, the problem is invalidating the 
> directory
> mapping in the middle of a series of getdents() if we do a CREATE.  
> Also, I
> think a similar thing happens if the directory's mtime or ctime is 
> updated -
> but in that case we set NFS_INO_INVALID_DATA because the change_attr
> updates.
>
> So, for directories with a large number of entries that updates often, 
> it can
> be very slow to list the directory.
>
> Why did 311324ad1713 change nfs_readdir from
>
> if (nfs_attribute_cache_expired(inode))
>     nfs_revalidate_mapping(inode, file->f_mapping);
> to
>
> if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & 
> NFS_INO_INVALID_DATA)
>     nfs_revalidate_mapping(inode, file->f_mapping);
>
> .. and can we go back to the way it was before?
>
> OK.. I understand why -- it is more correct since if we know the 
> directory has
> changed, we might as well fetch the change.  Otherwise, we might be 
> creating
> files and then wondering why they aren't listed.

But that won't happen unless a process deliberately starts over at 
position
0..  I think a user creating a file, then listing the directory would 
work
as expected.

Scott Mayhew already addressed this problem in:

07b5ce8ef2d8 ("NFS: Make nfs_readdir revalidate less often")

So, from 3.10 to 3.14 we had the behavior where we wouldn't clear the
mapping unless pos = 0 or the attribute cache expired.

> It might be nicer to not invalidate the mapping we're currently using 
> for
> readdir, though.  Maybe there's a way to keep the mapping for the 
> currently
> opened directory and invalidate it once it's closed.

POSIX readdir says:

     If a file is removed from or added to the directory after the most 
recent
     call to opendir() or rewinddir(), whether a subsequent call to 
readdir()
     returns an entry for that file is unspecified.

However, RFC 1813 says:

     The client should be careful to avoid holding directory entry 
cookies
     across operations that modify the directory contents, such as 
REMOVE and
     CREATE.

Ideally, we would keep using the same cookieverf until the server sends
NFS3ERR_BAD_COOKIE, and maybe just keep track of the last position/entry
cookie so we could pick up where we left off after invalidation.

Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 12:18     ` Trond Myklebust
@ 2016-08-24 13:15       ` Benjamin Coddington
  2016-08-24 13:39         ` Trond Myklebust
  2016-08-24 13:56       ` J. Bruce Fields
  1 sibling, 1 reply; 19+ messages in thread
From: Benjamin Coddington @ 2016-08-24 13:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: List Linux NFS Mailing

Apologies for the overlapping messages.. I had a mailer issue that 
caused me
to miss this message.

On 24 Aug 2016, at 8:18, Trond Myklebust wrote:
>> On Aug 23, 2016, at 17:21, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>> So, for directories with a large number of entries that updates 
>> often, it can
>> be very slow to list the directory.
>>
>> Why did 311324ad1713 change nfs_readdir from
>>
>> if (nfs_attribute_cache_expired(inode))
>>    nfs_revalidate_mapping(inode, file->f_mapping);
>> to
>>
>> if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & 
>> NFS_INO_INVALID_DATA)
>>    nfs_revalidate_mapping(inode, file->f_mapping);
>
> As the commit message says, the whole purpose was to use READDIRPLUS 
> as a substitute for multiple GETATTR calls when the heuristic tells us 
> that the user is performing an ‘ls -l’ style of workload.
>
>>
>> .. and can we go back to the way it was before?
>
> Not without slowing down ‘ls -l’ on large directories.

Ah, so my musing is not about reverting the READDIRPLUS optimization, 
just
the bit where we revalidate the mapping if NFS_INO_INVALID_DATA.

I'll send a patch for that, let's see what happens to it!

>> OK.. I understand why -- it is more correct since if we know the 
>> directory has
>> changed, we might as well fetch the change.  Otherwise, we might be 
>> creating
>> files and then wondering why they aren't listed.
>>
>> It might be nicer to not invalidate the mapping we're currently using 
>> for
>> readdir, though.  Maybe there's a way to keep the mapping for the 
>> currently
>> opened directory and invalidate it once it's closed.
>
>  POSIX requires that you revalidate on opendir() and rewinddir(), and
>  leaves behaviour w.r.t. file addition and removal after the call to
>  opendir()/rewinddir() undefined
>  
> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html).
>  I believe most filesystems under Linux will ensure that if a file is
>  removed, it is no longer seen in the readdir stream, and a file that 
> is
>  added will be seen unless the readdir cursor has already passed it, 
> so
>  there is that to consider.

In the NFS case it would be up to the server to decide if it wants to 
throw
out the in-use cookiverf with BADCOOKIE, or just include the new file at
some later entry cookie.  I think if the server wanted to insert the 
entry,
it must invalidate all currently in-use cookieverf values.

> However correctness was not the main issue here.

I think by "here" you mean the READDIRPLUS optimization, and not the 
POSIX
undefined behavior, in which case I take you to mean that the main issue 
was
performance.  That's the issue at hand for the case I described as well.
Dropping the mid-directory revalidation should keep the optimization for
your case and restore the optimization for this case.

Again, apologies for the messy mail thread.  :/

Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 13:15       ` Benjamin Coddington
@ 2016-08-24 13:39         ` Trond Myklebust
  0 siblings, 0 replies; 19+ messages in thread
From: Trond Myklebust @ 2016-08-24 13:39 UTC (permalink / raw)
  To: Coddington Benjamin; +Cc: List Linux NFS Mailing

DQo+IE9uIEF1ZyAyNCwgMjAxNiwgYXQgMDk6MTUsIEJlbmphbWluIENvZGRpbmd0b24gPGJjb2Rk
aW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPiANCj4gQXBvbG9naWVzIGZvciB0aGUgb3ZlcmxhcHBp
bmcgbWVzc2FnZXMuLiBJIGhhZCBhIG1haWxlciBpc3N1ZSB0aGF0IGNhdXNlZCBtZQ0KPiB0byBt
aXNzIHRoaXMgbWVzc2FnZS4NCj4gDQo+IE9uIDI0IEF1ZyAyMDE2LCBhdCA4OjE4LCBUcm9uZCBN
eWtsZWJ1c3Qgd3JvdGU6DQo+Pj4gT24gQXVnIDIzLCAyMDE2LCBhdCAxNzoyMSwgQmVuamFtaW4g
Q29kZGluZ3RvbiA8YmNvZGRpbmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+Pj4gU28sIGZvciBkaXJl
Y3RvcmllcyB3aXRoIGEgbGFyZ2UgbnVtYmVyIG9mIGVudHJpZXMgdGhhdCB1cGRhdGVzIG9mdGVu
LCBpdCBjYW4NCj4+PiBiZSB2ZXJ5IHNsb3cgdG8gbGlzdCB0aGUgZGlyZWN0b3J5Lg0KPj4+IA0K
Pj4+IFdoeSBkaWQgMzExMzI0YWQxNzEzIGNoYW5nZSBuZnNfcmVhZGRpciBmcm9tDQo+Pj4gDQo+
Pj4gaWYgKG5mc19hdHRyaWJ1dGVfY2FjaGVfZXhwaXJlZChpbm9kZSkpDQo+Pj4gICBuZnNfcmV2
YWxpZGF0ZV9tYXBwaW5nKGlub2RlLCBmaWxlLT5mX21hcHBpbmcpOw0KPj4+IHRvDQo+Pj4gDQo+
Pj4gaWYgKG5mc19hdHRyaWJ1dGVfY2FjaGVfZXhwaXJlZChpbm9kZSkgfHwgbmZzaS0+Y2FjaGVf
dmFsaWRpdHkgJiBORlNfSU5PX0lOVkFMSURfREFUQSkNCj4+PiAgIG5mc19yZXZhbGlkYXRlX21h
cHBpbmcoaW5vZGUsIGZpbGUtPmZfbWFwcGluZyk7DQo+PiANCj4+IEFzIHRoZSBjb21taXQgbWVz
c2FnZSBzYXlzLCB0aGUgd2hvbGUgcHVycG9zZSB3YXMgdG8gdXNlIFJFQURESVJQTFVTIGFzIGEg
c3Vic3RpdHV0ZSBmb3IgbXVsdGlwbGUgR0VUQVRUUiBjYWxscyB3aGVuIHRoZSBoZXVyaXN0aWMg
dGVsbHMgdXMgdGhhdCB0aGUgdXNlciBpcyBwZXJmb3JtaW5nIGFuIOKAmGxzIC1s4oCZIHN0eWxl
IG9mIHdvcmtsb2FkLg0KPj4gDQo+Pj4gDQo+Pj4gLi4gYW5kIGNhbiB3ZSBnbyBiYWNrIHRvIHRo
ZSB3YXkgaXQgd2FzIGJlZm9yZT8NCj4+IA0KPj4gTm90IHdpdGhvdXQgc2xvd2luZyBkb3duIOKA
mGxzIC1s4oCZIG9uIGxhcmdlIGRpcmVjdG9yaWVzLg0KPiANCj4gQWgsIHNvIG15IG11c2luZyBp
cyBub3QgYWJvdXQgcmV2ZXJ0aW5nIHRoZSBSRUFERElSUExVUyBvcHRpbWl6YXRpb24sIGp1c3QN
Cj4gdGhlIGJpdCB3aGVyZSB3ZSByZXZhbGlkYXRlIHRoZSBtYXBwaW5nIGlmIE5GU19JTk9fSU5W
QUxJRF9EQVRBLg0KPiANCj4gSSdsbCBzZW5kIGEgcGF0Y2ggZm9yIHRoYXQsIGxldCdzIHNlZSB3
aGF0IGhhcHBlbnMgdG8gaXQhDQo+IA0KPj4+IE9LLi4gSSB1bmRlcnN0YW5kIHdoeSAtLSBpdCBp
cyBtb3JlIGNvcnJlY3Qgc2luY2UgaWYgd2Uga25vdyB0aGUgZGlyZWN0b3J5IGhhcw0KPj4+IGNo
YW5nZWQsIHdlIG1pZ2h0IGFzIHdlbGwgZmV0Y2ggdGhlIGNoYW5nZS4gIE90aGVyd2lzZSwgd2Ug
bWlnaHQgYmUgY3JlYXRpbmcNCj4+PiBmaWxlcyBhbmQgdGhlbiB3b25kZXJpbmcgd2h5IHRoZXkg
YXJlbid0IGxpc3RlZC4NCj4+PiANCj4+PiBJdCBtaWdodCBiZSBuaWNlciB0byBub3QgaW52YWxp
ZGF0ZSB0aGUgbWFwcGluZyB3ZSdyZSBjdXJyZW50bHkgdXNpbmcgZm9yDQo+Pj4gcmVhZGRpciwg
dGhvdWdoLiAgTWF5YmUgdGhlcmUncyBhIHdheSB0byBrZWVwIHRoZSBtYXBwaW5nIGZvciB0aGUg
Y3VycmVudGx5DQo+Pj4gb3BlbmVkIGRpcmVjdG9yeSBhbmQgaW52YWxpZGF0ZSBpdCBvbmNlIGl0
J3MgY2xvc2VkLg0KPj4gDQo+PiBQT1NJWCByZXF1aXJlcyB0aGF0IHlvdSByZXZhbGlkYXRlIG9u
IG9wZW5kaXIoKSBhbmQgcmV3aW5kZGlyKCksIGFuZA0KPj4gbGVhdmVzIGJlaGF2aW91ciB3LnIu
dC4gZmlsZSBhZGRpdGlvbiBhbmQgcmVtb3ZhbCBhZnRlciB0aGUgY2FsbCB0bw0KPj4gb3BlbmRp
cigpL3Jld2luZGRpcigpIHVuZGVmaW5lZA0KPj4gKGh0dHA6Ly9wdWJzLm9wZW5ncm91cC5vcmcv
b25saW5lcHVicy85Njk5OTE5Nzk5L2Z1bmN0aW9ucy9yZWFkZGlyLmh0bWwpLg0KPj4gSSBiZWxp
ZXZlIG1vc3QgZmlsZXN5c3RlbXMgdW5kZXIgTGludXggd2lsbCBlbnN1cmUgdGhhdCBpZiBhIGZp
bGUgaXMNCj4+IHJlbW92ZWQsIGl0IGlzIG5vIGxvbmdlciBzZWVuIGluIHRoZSByZWFkZGlyIHN0
cmVhbSwgYW5kIGEgZmlsZSB0aGF0IGlzDQo+PiBhZGRlZCB3aWxsIGJlIHNlZW4gdW5sZXNzIHRo
ZSByZWFkZGlyIGN1cnNvciBoYXMgYWxyZWFkeSBwYXNzZWQgaXQsIHNvDQo+PiB0aGVyZSBpcyB0
aGF0IHRvIGNvbnNpZGVyLg0KPiANCj4gSW4gdGhlIE5GUyBjYXNlIGl0IHdvdWxkIGJlIHVwIHRv
IHRoZSBzZXJ2ZXIgdG8gZGVjaWRlIGlmIGl0IHdhbnRzIHRvIHRocm93DQo+IG91dCB0aGUgaW4t
dXNlIGNvb2tpdmVyZiB3aXRoIEJBRENPT0tJRSwgb3IganVzdCBpbmNsdWRlIHRoZSBuZXcgZmls
ZSBhdA0KPiBzb21lIGxhdGVyIGVudHJ5IGNvb2tpZS4gIEkgdGhpbmsgaWYgdGhlIHNlcnZlciB3
YW50ZWQgdG8gaW5zZXJ0IHRoZSBlbnRyeSwNCj4gaXQgbXVzdCBpbnZhbGlkYXRlIGFsbCBjdXJy
ZW50bHkgaW4tdXNlIGNvb2tpZXZlcmYgdmFsdWVzLg0KDQpUaGUgd2hvbGUgcmVhZGRpciB2ZXJp
ZmllciBjcmFwIGlzIG5vdCBzdXBwb3J0ZWQgb3IgaW1wbGVtZW50ZWQgaW4gdGhlIExpbnV4IGNs
aWVudC4gU2VlIHRoZSBwbGV0aG9yYSBvZiBoaXN0b3JpY2FsIHRocmVhZHMgb24gdGhpcyB0b3Bp
YyBhbmQgd2h5IHlvdSBjYW5ub3Qgc3VwcG9ydCB0ZWxsZGlyKCkvc2Vla2RpcigpIHNlbWFudGlj
cyB3aXRoIGNvb2tpZXMgdGhhdCBjaGFuZ2Ugd2lsbHkgbmlsbHkuLi4gDQoNCj4gDQo+PiBIb3dl
dmVyIGNvcnJlY3RuZXNzIHdhcyBub3QgdGhlIG1haW4gaXNzdWUgaGVyZS4NCj4gDQo+IEkgdGhp
bmsgYnkgImhlcmUiIHlvdSBtZWFuIHRoZSBSRUFERElSUExVUyBvcHRpbWl6YXRpb24sIGFuZCBu
b3QgdGhlIFBPU0lYDQo+IHVuZGVmaW5lZCBiZWhhdmlvciwgaW4gd2hpY2ggY2FzZSBJIHRha2Ug
eW91IHRvIG1lYW4gdGhhdCB0aGUgbWFpbiBpc3N1ZSB3YXMNCj4gcGVyZm9ybWFuY2UuICBUaGF0
J3MgdGhlIGlzc3VlIGF0IGhhbmQgZm9yIHRoZSBjYXNlIEkgZGVzY3JpYmVkIGFzIHdlbGwuDQo+
IERyb3BwaW5nIHRoZSBtaWQtZGlyZWN0b3J5IHJldmFsaWRhdGlvbiBzaG91bGQga2VlcCB0aGUg
b3B0aW1pemF0aW9uIGZvcg0KPiB5b3VyIGNhc2UgYW5kIHJlc3RvcmUgdGhlIG9wdGltaXphdGlv
biBmb3IgdGhpcyBjYXNlLg0KPiANCj4gQWdhaW4sIGFwb2xvZ2llcyBmb3IgdGhlIG1lc3N5IG1h
aWwgdGhyZWFkLiAgOi8NCj4gDQo+IEJlbg0KDQo=

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 12:18     ` Trond Myklebust
  2016-08-24 13:15       ` Benjamin Coddington
@ 2016-08-24 13:56       ` J. Bruce Fields
  2016-08-24 14:02         ` Trond Myklebust
  1 sibling, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2016-08-24 13:56 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Coddington Benjamin, List Linux NFS Mailing

On Wed, Aug 24, 2016 at 12:18:04PM +0000, Trond Myklebust wrote:
>=20
> > On Aug 23, 2016, at 17:21, Benjamin Coddington <bcodding@redhat.com> wr=
ote:
> >=20
> >=20
> >=20
> > On 23 Aug 2016, at 11:36, Trond Myklebust wrote:
> >=20
> >>> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@redhat.com> =
wrote:
> >>>=20
> >>> Hi linux-nfs,
> >>>=20
> >>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -=
l'
> >>> situations") changed when nfs_readdir() decides to revalidate the
> >>> directory's mapping, which contains all the entries.  In addition to =
just
> >>> checking if the attribute cache has expired, it includes a check to s=
ee if
> >>> NFS_INO_INVALID_DATA is set on the directory.
> >>>=20
> >>> Well, customers that have directories with very many dentries and tha=
t same
> >>> directory's attributes are frequently updated are now grouchy that `l=
s -l`
> >>> takes so long since any update of the directory causes the mapping to=
 be
> >>> invalidated and we have to start over filling the directory's mapping.
> >>>=20
> >>> I actually haven't put real hard thought into it yet (because often f=
or me
> >>> that just wastes a lot of time), so I am doing the lazy thing by aski=
ng this
> >>> question:
> >>>=20
> >>> Can we go back to just the using the attribute cache timeout, or shou=
ld we
> >>> get all heuristical about readdir?
> >>>=20
> >>=20
> >> We are all heuristical at this point. How are the heuristics failing?
> >>=20
> >> The original problem those heuristics were designed to solve was that =
all
> >> the stat() calls took forever to complete, since they are all synchron=
ous;
> >> Tigran showed some very convincing numbers for a large directory where=
 the
> >> difference in performance was an order of magnitude improved by using
> >> readdirplus instead of readdir=E2=80=A6
> >=20
> > I'll try to present a better explanation.  While `ls -l` is walking thr=
ough
> > a directory repeatedly entering nfs_readdir(), a CREATE response send us
> > through nfs_post_op_update_inode_locked():
> >=20
> > 1531     if (S_ISDIR(inode->i_mode))
> > 1532         invalid |=3D NFS_INO_INVALID_DATA;
> > 1533     nfs_set_cache_invalid(inode, invalid);
> >=20
> > Now, the next entry into nfs_readdir() has us do nfs_revalidate_mapping=
(),
> > which will do nfs_invalidate_mapping() for the directory, and so we hav=
e to
> > start over with cookie 0 sending READDIRPLUS to re-fill the directory's
> > mapping to get back to where we are for the current nfs_readdir().
> >=20
> > This process repeats for every entry into nfs_readdir() if the directory
> > keeps getting updated, and it becomes more likely that it will be updat=
ed as
> > each pass takes longer and longer to re-fill the mapping as the current
> > nfs_readdir() invocation is further along.
> >=20
> > READDIRPLUS isn't the problem, the problem is invalidating the directory
> > mapping in the middle of a series of getdents() if we do a CREATE.  Als=
o, I
> > think a similar thing happens if the directory's mtime or ctime is upda=
ted -
> > but in that case we set NFS_INO_INVALID_DATA because the change_attr
> > updates.
> >=20
> > So, for directories with a large number of entries that updates often, =
it can
> > be very slow to list the directory.
> >=20
> > Why did 311324ad1713 change nfs_readdir from
> >=20
> > if (nfs_attribute_cache_expired(inode))
> >    nfs_revalidate_mapping(inode, file->f_mapping);
> > to
> >=20
> > if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & NFS_IN=
O_INVALID_DATA)
> >    nfs_revalidate_mapping(inode, file->f_mapping);
>=20
> As the commit message says, the whole purpose was to use READDIRPLUS as a=
 substitute for multiple GETATTR calls when the heuristic tells us that the=
 user is performing an =E2=80=98ls -l=E2=80=99 style of workload.
>=20
> >=20
> > .. and can we go back to the way it was before?
>=20
> Not without slowing down =E2=80=98ls -l=E2=80=99 on large directories.
>=20
> >=20
> > OK.. I understand why -- it is more correct since if we know the direct=
ory has
> > changed, we might as well fetch the change.  Otherwise, we might be cre=
ating
> > files and then wondering why they aren't listed.
> >=20
> > It might be nicer to not invalidate the mapping we're currently using f=
or
> > readdir, though.  Maybe there's a way to keep the mapping for the curre=
ntly
> > opened directory and invalidate it once it's closed.
>=20

>  POSIX requires that you revalidate on opendir() and rewinddir(), and
>  leaves behaviour w.r.t. file addition and removal after the call to
>  opendir()/rewinddir() undefined
>  (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html).

It is only undefined whether the added or removed entry is returned.
Other entries still need to returned exactly once.

In this case we're restarting the read of the directory from scratch--I
don't understand how that's possible while avoiding skipped or
duplicated entries.

Surely the only safe thing to do is to continue reading using the last
cookie returned from the server.

--b.

>  I believe most filesystems under Linux will ensure that if a file is
>  removed, it is no longer seen in the readdir stream, and a file that
>  is added will be seen unless the readdir cursor has already passed
>  it, so there is that to consider.

>=20
> However correctness was not the main issue here.
>=20
> N?????r??y????b?X??=C7=A7v?^?)=DE=BA{.n?+????{???"??^n?r???z?=1A??h?????&=
??=1E?G???h?=03(?=E9=9A=8E?=DD=A2j"??=1A?=1Bm??????z?=DE=96???f???h???~?m

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 13:56       ` J. Bruce Fields
@ 2016-08-24 14:02         ` Trond Myklebust
  2016-08-24 14:16           ` Benjamin Coddington
  2016-08-24 14:20           ` Fields Bruce James
  0 siblings, 2 replies; 19+ messages in thread
From: Trond Myklebust @ 2016-08-24 14:02 UTC (permalink / raw)
  To: Fields Bruce James; +Cc: Coddington Benjamin, List Linux NFS Mailing

DQo+IE9uIEF1ZyAyNCwgMjAxNiwgYXQgMDk6NTYsIEouIEJydWNlIEZpZWxkcyA8YmZpZWxkc0Bm
aWVsZHNlcy5vcmc+IHdyb3RlOg0KPiANCj4gT24gV2VkLCBBdWcgMjQsIDIwMTYgYXQgMTI6MTg6
MDRQTSArMDAwMCwgVHJvbmQgTXlrbGVidXN0IHdyb3RlOg0KPj4gDQo+Pj4gT24gQXVnIDIzLCAy
MDE2LCBhdCAxNzoyMSwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRpbmdAcmVkaGF0LmNvbT4g
d3JvdGU6DQo+Pj4gDQo+Pj4gDQo+Pj4gDQo+Pj4gT24gMjMgQXVnIDIwMTYsIGF0IDExOjM2LCBU
cm9uZCBNeWtsZWJ1c3Qgd3JvdGU6DQo+Pj4gDQo+Pj4+PiBPbiBBdWcgMjMsIDIwMTYsIGF0IDEx
OjA5LCBCZW5qYW1pbiBDb2RkaW5ndG9uIDxiY29kZGluZ0ByZWRoYXQuY29tPiB3cm90ZToNCj4+
Pj4+IA0KPj4+Pj4gSGkgbGludXgtbmZzLA0KPj4+Pj4gDQo+Pj4+PiAzMTEzMjRhZDE3MTMgKCJO
RlM6IEJlIG1vcmUgYWdncmVzc2l2ZSBpbiB1c2luZyByZWFkZGlycGx1cyBmb3IgJ2xzIC1sJw0K
Pj4+Pj4gc2l0dWF0aW9ucyIpIGNoYW5nZWQgd2hlbiBuZnNfcmVhZGRpcigpIGRlY2lkZXMgdG8g
cmV2YWxpZGF0ZSB0aGUNCj4+Pj4+IGRpcmVjdG9yeSdzIG1hcHBpbmcsIHdoaWNoIGNvbnRhaW5z
IGFsbCB0aGUgZW50cmllcy4gIEluIGFkZGl0aW9uIHRvIGp1c3QNCj4+Pj4+IGNoZWNraW5nIGlm
IHRoZSBhdHRyaWJ1dGUgY2FjaGUgaGFzIGV4cGlyZWQsIGl0IGluY2x1ZGVzIGEgY2hlY2sgdG8g
c2VlIGlmDQo+Pj4+PiBORlNfSU5PX0lOVkFMSURfREFUQSBpcyBzZXQgb24gdGhlIGRpcmVjdG9y
eS4NCj4+Pj4+IA0KPj4+Pj4gV2VsbCwgY3VzdG9tZXJzIHRoYXQgaGF2ZSBkaXJlY3RvcmllcyB3
aXRoIHZlcnkgbWFueSBkZW50cmllcyBhbmQgdGhhdCBzYW1lDQo+Pj4+PiBkaXJlY3RvcnkncyBh
dHRyaWJ1dGVzIGFyZSBmcmVxdWVudGx5IHVwZGF0ZWQgYXJlIG5vdyBncm91Y2h5IHRoYXQgYGxz
IC1sYA0KPj4+Pj4gdGFrZXMgc28gbG9uZyBzaW5jZSBhbnkgdXBkYXRlIG9mIHRoZSBkaXJlY3Rv
cnkgY2F1c2VzIHRoZSBtYXBwaW5nIHRvIGJlDQo+Pj4+PiBpbnZhbGlkYXRlZCBhbmQgd2UgaGF2
ZSB0byBzdGFydCBvdmVyIGZpbGxpbmcgdGhlIGRpcmVjdG9yeSdzIG1hcHBpbmcuDQo+Pj4+PiAN
Cj4+Pj4+IEkgYWN0dWFsbHkgaGF2ZW4ndCBwdXQgcmVhbCBoYXJkIHRob3VnaHQgaW50byBpdCB5
ZXQgKGJlY2F1c2Ugb2Z0ZW4gZm9yIG1lDQo+Pj4+PiB0aGF0IGp1c3Qgd2FzdGVzIGEgbG90IG9m
IHRpbWUpLCBzbyBJIGFtIGRvaW5nIHRoZSBsYXp5IHRoaW5nIGJ5IGFza2luZyB0aGlzDQo+Pj4+
PiBxdWVzdGlvbjoNCj4+Pj4+IA0KPj4+Pj4gQ2FuIHdlIGdvIGJhY2sgdG8ganVzdCB0aGUgdXNp
bmcgdGhlIGF0dHJpYnV0ZSBjYWNoZSB0aW1lb3V0LCBvciBzaG91bGQgd2UNCj4+Pj4+IGdldCBh
bGwgaGV1cmlzdGljYWwgYWJvdXQgcmVhZGRpcj8NCj4+Pj4+IA0KPj4+PiANCj4+Pj4gV2UgYXJl
IGFsbCBoZXVyaXN0aWNhbCBhdCB0aGlzIHBvaW50LiBIb3cgYXJlIHRoZSBoZXVyaXN0aWNzIGZh
aWxpbmc/DQo+Pj4+IA0KPj4+PiBUaGUgb3JpZ2luYWwgcHJvYmxlbSB0aG9zZSBoZXVyaXN0aWNz
IHdlcmUgZGVzaWduZWQgdG8gc29sdmUgd2FzIHRoYXQgYWxsDQo+Pj4+IHRoZSBzdGF0KCkgY2Fs
bHMgdG9vayBmb3JldmVyIHRvIGNvbXBsZXRlLCBzaW5jZSB0aGV5IGFyZSBhbGwgc3luY2hyb25v
dXM7DQo+Pj4+IFRpZ3JhbiBzaG93ZWQgc29tZSB2ZXJ5IGNvbnZpbmNpbmcgbnVtYmVycyBmb3Ig
YSBsYXJnZSBkaXJlY3Rvcnkgd2hlcmUgdGhlDQo+Pj4+IGRpZmZlcmVuY2UgaW4gcGVyZm9ybWFu
Y2Ugd2FzIGFuIG9yZGVyIG9mIG1hZ25pdHVkZSBpbXByb3ZlZCBieSB1c2luZw0KPj4+PiByZWFk
ZGlycGx1cyBpbnN0ZWFkIG9mIHJlYWRkaXLigKYNCj4+PiANCj4+PiBJJ2xsIHRyeSB0byBwcmVz
ZW50IGEgYmV0dGVyIGV4cGxhbmF0aW9uLiAgV2hpbGUgYGxzIC1sYCBpcyB3YWxraW5nIHRocm91
Z2gNCj4+PiBhIGRpcmVjdG9yeSByZXBlYXRlZGx5IGVudGVyaW5nIG5mc19yZWFkZGlyKCksIGEg
Q1JFQVRFIHJlc3BvbnNlIHNlbmQgdXMNCj4+PiB0aHJvdWdoIG5mc19wb3N0X29wX3VwZGF0ZV9p
bm9kZV9sb2NrZWQoKToNCj4+PiANCj4+PiAxNTMxICAgICBpZiAoU19JU0RJUihpbm9kZS0+aV9t
b2RlKSkNCj4+PiAxNTMyICAgICAgICAgaW52YWxpZCB8PSBORlNfSU5PX0lOVkFMSURfREFUQTsN
Cj4+PiAxNTMzICAgICBuZnNfc2V0X2NhY2hlX2ludmFsaWQoaW5vZGUsIGludmFsaWQpOw0KPj4+
IA0KPj4+IE5vdywgdGhlIG5leHQgZW50cnkgaW50byBuZnNfcmVhZGRpcigpIGhhcyB1cyBkbyBu
ZnNfcmV2YWxpZGF0ZV9tYXBwaW5nKCksDQo+Pj4gd2hpY2ggd2lsbCBkbyBuZnNfaW52YWxpZGF0
ZV9tYXBwaW5nKCkgZm9yIHRoZSBkaXJlY3RvcnksIGFuZCBzbyB3ZSBoYXZlIHRvDQo+Pj4gc3Rh
cnQgb3ZlciB3aXRoIGNvb2tpZSAwIHNlbmRpbmcgUkVBRERJUlBMVVMgdG8gcmUtZmlsbCB0aGUg
ZGlyZWN0b3J5J3MNCj4+PiBtYXBwaW5nIHRvIGdldCBiYWNrIHRvIHdoZXJlIHdlIGFyZSBmb3Ig
dGhlIGN1cnJlbnQgbmZzX3JlYWRkaXIoKS4NCj4+PiANCj4+PiBUaGlzIHByb2Nlc3MgcmVwZWF0
cyBmb3IgZXZlcnkgZW50cnkgaW50byBuZnNfcmVhZGRpcigpIGlmIHRoZSBkaXJlY3RvcnkNCj4+
PiBrZWVwcyBnZXR0aW5nIHVwZGF0ZWQsIGFuZCBpdCBiZWNvbWVzIG1vcmUgbGlrZWx5IHRoYXQg
aXQgd2lsbCBiZSB1cGRhdGVkIGFzDQo+Pj4gZWFjaCBwYXNzIHRha2VzIGxvbmdlciBhbmQgbG9u
Z2VyIHRvIHJlLWZpbGwgdGhlIG1hcHBpbmcgYXMgdGhlIGN1cnJlbnQNCj4+PiBuZnNfcmVhZGRp
cigpIGludm9jYXRpb24gaXMgZnVydGhlciBhbG9uZy4NCj4+PiANCj4+PiBSRUFERElSUExVUyBp
c24ndCB0aGUgcHJvYmxlbSwgdGhlIHByb2JsZW0gaXMgaW52YWxpZGF0aW5nIHRoZSBkaXJlY3Rv
cnkNCj4+PiBtYXBwaW5nIGluIHRoZSBtaWRkbGUgb2YgYSBzZXJpZXMgb2YgZ2V0ZGVudHMoKSBp
ZiB3ZSBkbyBhIENSRUFURS4gIEFsc28sIEkNCj4+PiB0aGluayBhIHNpbWlsYXIgdGhpbmcgaGFw
cGVucyBpZiB0aGUgZGlyZWN0b3J5J3MgbXRpbWUgb3IgY3RpbWUgaXMgdXBkYXRlZCAtDQo+Pj4g
YnV0IGluIHRoYXQgY2FzZSB3ZSBzZXQgTkZTX0lOT19JTlZBTElEX0RBVEEgYmVjYXVzZSB0aGUg
Y2hhbmdlX2F0dHINCj4+PiB1cGRhdGVzLg0KPj4+IA0KPj4+IFNvLCBmb3IgZGlyZWN0b3JpZXMg
d2l0aCBhIGxhcmdlIG51bWJlciBvZiBlbnRyaWVzIHRoYXQgdXBkYXRlcyBvZnRlbiwgaXQgY2Fu
DQo+Pj4gYmUgdmVyeSBzbG93IHRvIGxpc3QgdGhlIGRpcmVjdG9yeS4NCj4+PiANCj4+PiBXaHkg
ZGlkIDMxMTMyNGFkMTcxMyBjaGFuZ2UgbmZzX3JlYWRkaXIgZnJvbQ0KPj4+IA0KPj4+IGlmIChu
ZnNfYXR0cmlidXRlX2NhY2hlX2V4cGlyZWQoaW5vZGUpKQ0KPj4+ICAgbmZzX3JldmFsaWRhdGVf
bWFwcGluZyhpbm9kZSwgZmlsZS0+Zl9tYXBwaW5nKTsNCj4+PiB0bw0KPj4+IA0KPj4+IGlmIChu
ZnNfYXR0cmlidXRlX2NhY2hlX2V4cGlyZWQoaW5vZGUpIHx8IG5mc2ktPmNhY2hlX3ZhbGlkaXR5
ICYgTkZTX0lOT19JTlZBTElEX0RBVEEpDQo+Pj4gICBuZnNfcmV2YWxpZGF0ZV9tYXBwaW5nKGlu
b2RlLCBmaWxlLT5mX21hcHBpbmcpOw0KPj4gDQo+PiBBcyB0aGUgY29tbWl0IG1lc3NhZ2Ugc2F5
cywgdGhlIHdob2xlIHB1cnBvc2Ugd2FzIHRvIHVzZSBSRUFERElSUExVUyBhcyBhIHN1YnN0aXR1
dGUgZm9yIG11bHRpcGxlIEdFVEFUVFIgY2FsbHMgd2hlbiB0aGUgaGV1cmlzdGljIHRlbGxzIHVz
IHRoYXQgdGhlIHVzZXIgaXMgcGVyZm9ybWluZyBhbiDigJhscyAtbOKAmSBzdHlsZSBvZiB3b3Jr
bG9hZC4NCj4+IA0KPj4+IA0KPj4+IC4uIGFuZCBjYW4gd2UgZ28gYmFjayB0byB0aGUgd2F5IGl0
IHdhcyBiZWZvcmU/DQo+PiANCj4+IE5vdCB3aXRob3V0IHNsb3dpbmcgZG93biDigJhscyAtbOKA
mSBvbiBsYXJnZSBkaXJlY3Rvcmllcy4NCj4+IA0KPj4+IA0KPj4+IE9LLi4gSSB1bmRlcnN0YW5k
IHdoeSAtLSBpdCBpcyBtb3JlIGNvcnJlY3Qgc2luY2UgaWYgd2Uga25vdyB0aGUgZGlyZWN0b3J5
IGhhcw0KPj4+IGNoYW5nZWQsIHdlIG1pZ2h0IGFzIHdlbGwgZmV0Y2ggdGhlIGNoYW5nZS4gIE90
aGVyd2lzZSwgd2UgbWlnaHQgYmUgY3JlYXRpbmcNCj4+PiBmaWxlcyBhbmQgdGhlbiB3b25kZXJp
bmcgd2h5IHRoZXkgYXJlbid0IGxpc3RlZC4NCj4+PiANCj4+PiBJdCBtaWdodCBiZSBuaWNlciB0
byBub3QgaW52YWxpZGF0ZSB0aGUgbWFwcGluZyB3ZSdyZSBjdXJyZW50bHkgdXNpbmcgZm9yDQo+
Pj4gcmVhZGRpciwgdGhvdWdoLiAgTWF5YmUgdGhlcmUncyBhIHdheSB0byBrZWVwIHRoZSBtYXBw
aW5nIGZvciB0aGUgY3VycmVudGx5DQo+Pj4gb3BlbmVkIGRpcmVjdG9yeSBhbmQgaW52YWxpZGF0
ZSBpdCBvbmNlIGl0J3MgY2xvc2VkLg0KPj4gDQo+IA0KPj4gUE9TSVggcmVxdWlyZXMgdGhhdCB5
b3UgcmV2YWxpZGF0ZSBvbiBvcGVuZGlyKCkgYW5kIHJld2luZGRpcigpLCBhbmQNCj4+IGxlYXZl
cyBiZWhhdmlvdXIgdy5yLnQuIGZpbGUgYWRkaXRpb24gYW5kIHJlbW92YWwgYWZ0ZXIgdGhlIGNh
bGwgdG8NCj4+IG9wZW5kaXIoKS9yZXdpbmRkaXIoKSB1bmRlZmluZWQNCj4+IChodHRwOi8vcHVi
cy5vcGVuZ3JvdXAub3JnL29ubGluZXB1YnMvOTY5OTkxOTc5OS9mdW5jdGlvbnMvcmVhZGRpci5o
dG1sKS4NCj4gDQo+IEl0IGlzIG9ubHkgdW5kZWZpbmVkIHdoZXRoZXIgdGhlIGFkZGVkIG9yIHJl
bW92ZWQgZW50cnkgaXMgcmV0dXJuZWQuDQo+IE90aGVyIGVudHJpZXMgc3RpbGwgbmVlZCB0byBy
ZXR1cm5lZCBleGFjdGx5IG9uY2UuDQo+IA0KPiBJbiB0aGlzIGNhc2Ugd2UncmUgcmVzdGFydGlu
ZyB0aGUgcmVhZCBvZiB0aGUgZGlyZWN0b3J5IGZyb20gc2NyYXRjaC0tSQ0KPiBkb24ndCB1bmRl
cnN0YW5kIGhvdyB0aGF0J3MgcG9zc2libGUgd2hpbGUgYXZvaWRpbmcgc2tpcHBlZCBvcg0KPiBk
dXBsaWNhdGVkIGVudHJpZXMuDQo+IA0KPiBTdXJlbHkgdGhlIG9ubHkgc2FmZSB0aGluZyB0byBk
byBpcyB0byBjb250aW51ZSByZWFkaW5nIHVzaW5nIHRoZSBsYXN0DQo+IGNvb2tpZSByZXR1cm5l
ZCBmcm9tIHRoZSBzZXJ2ZXIuDQoNCldoeT8gVGhlIGNsaWVudCBzaG91bGQgYmUgYWJsZSB0byBy
ZXN0YXJ0IHVzaW5nIGFueSBjb29raWUgYXQgYW55IHRpbWUsIGFuZCB3ZSByZWx5IG9uIHRoZSBj
b29raWVzIGJlaW5nIHVuaXF1ZSB0byBlYWNoIGVudHJ5LiBJZiB5b3Ugd2FudCBtb3JlIHJlbGF4
ZWQgY29va2llIHNlbWFudGljcyB0aGVuIGJlIHByZXBhcmVkIHRvIGhhdmUgdG8gc2V0IHVwIGEg
c3RhdGVmdWwgTkZTIHJlYWRkaXIgcHJvdG9jb2wuDQoNClRyb25k

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 14:02         ` Trond Myklebust
@ 2016-08-24 14:16           ` Benjamin Coddington
  2016-08-24 14:19             ` Trond Myklebust
  2016-08-24 14:20           ` Fields Bruce James
  1 sibling, 1 reply; 19+ messages in thread
From: Benjamin Coddington @ 2016-08-24 14:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Fields Bruce James, List Linux NFS Mailing

On 24 Aug 2016, at 10:02, Trond Myklebust wrote:

>> On Aug 24, 2016, at 09:56, J. Bruce Fields <bfields@fieldses.org> 
>> wrote:
>>
>> On Wed, Aug 24, 2016 at 12:18:04PM +0000, Trond Myklebust wrote:
>>>
>>>> On Aug 23, 2016, at 17:21, Benjamin Coddington 
>>>> <bcodding@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 23 Aug 2016, at 11:36, Trond Myklebust wrote:
>>>>
>>>>>> On Aug 23, 2016, at 11:09, Benjamin Coddington 
>>>>>> <bcodding@redhat.com> wrote:
>>>>>>
>>>>>> Hi linux-nfs,
>>>>>>
>>>>>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 
>>>>>> 'ls -l'
>>>>>> situations") changed when nfs_readdir() decides to revalidate the
>>>>>> directory's mapping, which contains all the entries.  In addition 
>>>>>> to just
>>>>>> checking if the attribute cache has expired, it includes a check 
>>>>>> to see if
>>>>>> NFS_INO_INVALID_DATA is set on the directory.
>>>>>>
>>>>>> Well, customers that have directories with very many dentries and 
>>>>>> that same
>>>>>> directory's attributes are frequently updated are now grouchy 
>>>>>> that `ls -l`
>>>>>> takes so long since any update of the directory causes the 
>>>>>> mapping to be
>>>>>> invalidated and we have to start over filling the directory's 
>>>>>> mapping.
>>>>>>
>>>>>> I actually haven't put real hard thought into it yet (because 
>>>>>> often for me
>>>>>> that just wastes a lot of time), so I am doing the lazy thing by 
>>>>>> asking this
>>>>>> question:
>>>>>>
>>>>>> Can we go back to just the using the attribute cache timeout, or 
>>>>>> should we
>>>>>> get all heuristical about readdir?
>>>>>>
>>>>>
>>>>> We are all heuristical at this point. How are the heuristics 
>>>>> failing?
>>>>>
>>>>> The original problem those heuristics were designed to solve was 
>>>>> that all
>>>>> the stat() calls took forever to complete, since they are all 
>>>>> synchronous;
>>>>> Tigran showed some very convincing numbers for a large directory 
>>>>> where the
>>>>> difference in performance was an order of magnitude improved by 
>>>>> using
>>>>> readdirplus instead of readdir…
>>>>
>>>> I'll try to present a better explanation.  While `ls -l` is walking 
>>>> through
>>>> a directory repeatedly entering nfs_readdir(), a CREATE response 
>>>> send us
>>>> through nfs_post_op_update_inode_locked():
>>>>
>>>> 1531     if (S_ISDIR(inode->i_mode))
>>>> 1532         invalid |= NFS_INO_INVALID_DATA;
>>>> 1533     nfs_set_cache_invalid(inode, invalid);
>>>>
>>>> Now, the next entry into nfs_readdir() has us do 
>>>> nfs_revalidate_mapping(),
>>>> which will do nfs_invalidate_mapping() for the directory, and so we 
>>>> have to
>>>> start over with cookie 0 sending READDIRPLUS to re-fill the 
>>>> directory's
>>>> mapping to get back to where we are for the current nfs_readdir().
>>>>
>>>> This process repeats for every entry into nfs_readdir() if the 
>>>> directory
>>>> keeps getting updated, and it becomes more likely that it will be 
>>>> updated as
>>>> each pass takes longer and longer to re-fill the mapping as the 
>>>> current
>>>> nfs_readdir() invocation is further along.
>>>>
>>>> READDIRPLUS isn't the problem, the problem is invalidating the 
>>>> directory
>>>> mapping in the middle of a series of getdents() if we do a CREATE.  
>>>> Also, I
>>>> think a similar thing happens if the directory's mtime or ctime is 
>>>> updated -
>>>> but in that case we set NFS_INO_INVALID_DATA because the 
>>>> change_attr
>>>> updates.
>>>>
>>>> So, for directories with a large number of entries that updates 
>>>> often, it can
>>>> be very slow to list the directory.
>>>>
>>>> Why did 311324ad1713 change nfs_readdir from
>>>>
>>>> if (nfs_attribute_cache_expired(inode))
>>>>   nfs_revalidate_mapping(inode, file->f_mapping);
>>>> to
>>>>
>>>> if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & 
>>>> NFS_INO_INVALID_DATA)
>>>>   nfs_revalidate_mapping(inode, file->f_mapping);
>>>
>>> As the commit message says, the whole purpose was to use READDIRPLUS 
>>> as a substitute for multiple GETATTR calls when the heuristic tells 
>>> us that the user is performing an ‘ls -l’ style of workload.
>>>
>>>>
>>>> .. and can we go back to the way it was before?
>>>
>>> Not without slowing down ‘ls -l’ on large directories.
>>>
>>>>
>>>> OK.. I understand why -- it is more correct since if we know the 
>>>> directory has
>>>> changed, we might as well fetch the change.  Otherwise, we might be 
>>>> creating
>>>> files and then wondering why they aren't listed.
>>>>
>>>> It might be nicer to not invalidate the mapping we're currently 
>>>> using for
>>>> readdir, though.  Maybe there's a way to keep the mapping for the 
>>>> currently
>>>> opened directory and invalidate it once it's closed.
>>>
>>
>>> POSIX requires that you revalidate on opendir() and rewinddir(), and
>>> leaves behaviour w.r.t. file addition and removal after the call to
>>> opendir()/rewinddir() undefined
>>> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html).
>>
>> It is only undefined whether the added or removed entry is returned.
>> Other entries still need to returned exactly once.
>>
>> In this case we're restarting the read of the directory from 
>> scratch--I
>> don't understand how that's possible while avoiding skipped or
>> duplicated entries.
>>
>> Surely the only safe thing to do is to continue reading using the 
>> last
>> cookie returned from the server.
>
> Why? The client should be able to restart using any cookie at any 
> time,
> and we rely on the cookies being unique to each entry. If you want 
> more
> relaxed cookie semantics then be prepared to have to set up a stateful 
> NFS
> readdir protocol.

It looks to me as if I send in a non-zero position to nfs_readdir(), but 
the mapping
has been invalidated, then the client starts over at cookie 0 sending
READDIRs to the server and reading the directory entries into the page 
cache
in order to figure out which position maps to which entry/cookie.

It seems possible to get duplicated or skipped entries.

Ben



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 14:16           ` Benjamin Coddington
@ 2016-08-24 14:19             ` Trond Myklebust
  2016-08-24 15:18               ` Benjamin Coddington
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2016-08-24 14:19 UTC (permalink / raw)
  To: Coddington Benjamin; +Cc: Fields Bruce James, List Linux NFS Mailing

DQo+IE9uIEF1ZyAyNCwgMjAxNiwgYXQgMTA6MTYsIEJlbmphbWluIENvZGRpbmd0b24gPGJjb2Rk
aW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPiANCj4gT24gMjQgQXVnIDIwMTYsIGF0IDEwOjAyLCBU
cm9uZCBNeWtsZWJ1c3Qgd3JvdGU6DQo+IA0KPj4+IE9uIEF1ZyAyNCwgMjAxNiwgYXQgMDk6NTYs
IEouIEJydWNlIEZpZWxkcyA8YmZpZWxkc0BmaWVsZHNlcy5vcmc+IHdyb3RlOg0KPj4+IA0KPj4+
IE9uIFdlZCwgQXVnIDI0LCAyMDE2IGF0IDEyOjE4OjA0UE0gKzAwMDAsIFRyb25kIE15a2xlYnVz
dCB3cm90ZToNCj4+Pj4gDQo+Pj4+PiBPbiBBdWcgMjMsIDIwMTYsIGF0IDE3OjIxLCBCZW5qYW1p
biBDb2RkaW5ndG9uIDxiY29kZGluZ0ByZWRoYXQuY29tPiB3cm90ZToNCj4+Pj4+IA0KPj4+Pj4g
DQo+Pj4+PiANCj4+Pj4+IE9uIDIzIEF1ZyAyMDE2LCBhdCAxMTozNiwgVHJvbmQgTXlrbGVidXN0
IHdyb3RlOg0KPj4+Pj4gDQo+Pj4+Pj4+IE9uIEF1ZyAyMywgMjAxNiwgYXQgMTE6MDksIEJlbmph
bWluIENvZGRpbmd0b24gPGJjb2RkaW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPj4+Pj4+PiANCj4+
Pj4+Pj4gSGkgbGludXgtbmZzLA0KPj4+Pj4+PiANCj4+Pj4+Pj4gMzExMzI0YWQxNzEzICgiTkZT
OiBCZSBtb3JlIGFnZ3Jlc3NpdmUgaW4gdXNpbmcgcmVhZGRpcnBsdXMgZm9yICdscyAtbCcNCj4+
Pj4+Pj4gc2l0dWF0aW9ucyIpIGNoYW5nZWQgd2hlbiBuZnNfcmVhZGRpcigpIGRlY2lkZXMgdG8g
cmV2YWxpZGF0ZSB0aGUNCj4+Pj4+Pj4gZGlyZWN0b3J5J3MgbWFwcGluZywgd2hpY2ggY29udGFp
bnMgYWxsIHRoZSBlbnRyaWVzLiAgSW4gYWRkaXRpb24gdG8ganVzdA0KPj4+Pj4+PiBjaGVja2lu
ZyBpZiB0aGUgYXR0cmlidXRlIGNhY2hlIGhhcyBleHBpcmVkLCBpdCBpbmNsdWRlcyBhIGNoZWNr
IHRvIHNlZSBpZg0KPj4+Pj4+PiBORlNfSU5PX0lOVkFMSURfREFUQSBpcyBzZXQgb24gdGhlIGRp
cmVjdG9yeS4NCj4+Pj4+Pj4gDQo+Pj4+Pj4+IFdlbGwsIGN1c3RvbWVycyB0aGF0IGhhdmUgZGly
ZWN0b3JpZXMgd2l0aCB2ZXJ5IG1hbnkgZGVudHJpZXMgYW5kIHRoYXQgc2FtZQ0KPj4+Pj4+PiBk
aXJlY3RvcnkncyBhdHRyaWJ1dGVzIGFyZSBmcmVxdWVudGx5IHVwZGF0ZWQgYXJlIG5vdyBncm91
Y2h5IHRoYXQgYGxzIC1sYA0KPj4+Pj4+PiB0YWtlcyBzbyBsb25nIHNpbmNlIGFueSB1cGRhdGUg
b2YgdGhlIGRpcmVjdG9yeSBjYXVzZXMgdGhlIG1hcHBpbmcgdG8gYmUNCj4+Pj4+Pj4gaW52YWxp
ZGF0ZWQgYW5kIHdlIGhhdmUgdG8gc3RhcnQgb3ZlciBmaWxsaW5nIHRoZSBkaXJlY3RvcnkncyBt
YXBwaW5nLg0KPj4+Pj4+PiANCj4+Pj4+Pj4gSSBhY3R1YWxseSBoYXZlbid0IHB1dCByZWFsIGhh
cmQgdGhvdWdodCBpbnRvIGl0IHlldCAoYmVjYXVzZSBvZnRlbiBmb3IgbWUNCj4+Pj4+Pj4gdGhh
dCBqdXN0IHdhc3RlcyBhIGxvdCBvZiB0aW1lKSwgc28gSSBhbSBkb2luZyB0aGUgbGF6eSB0aGlu
ZyBieSBhc2tpbmcgdGhpcw0KPj4+Pj4+PiBxdWVzdGlvbjoNCj4+Pj4+Pj4gDQo+Pj4+Pj4+IENh
biB3ZSBnbyBiYWNrIHRvIGp1c3QgdGhlIHVzaW5nIHRoZSBhdHRyaWJ1dGUgY2FjaGUgdGltZW91
dCwgb3Igc2hvdWxkIHdlDQo+Pj4+Pj4+IGdldCBhbGwgaGV1cmlzdGljYWwgYWJvdXQgcmVhZGRp
cj8NCj4+Pj4+Pj4gDQo+Pj4+Pj4gDQo+Pj4+Pj4gV2UgYXJlIGFsbCBoZXVyaXN0aWNhbCBhdCB0
aGlzIHBvaW50LiBIb3cgYXJlIHRoZSBoZXVyaXN0aWNzIGZhaWxpbmc/DQo+Pj4+Pj4gDQo+Pj4+
Pj4gVGhlIG9yaWdpbmFsIHByb2JsZW0gdGhvc2UgaGV1cmlzdGljcyB3ZXJlIGRlc2lnbmVkIHRv
IHNvbHZlIHdhcyB0aGF0IGFsbA0KPj4+Pj4+IHRoZSBzdGF0KCkgY2FsbHMgdG9vayBmb3JldmVy
IHRvIGNvbXBsZXRlLCBzaW5jZSB0aGV5IGFyZSBhbGwgc3luY2hyb25vdXM7DQo+Pj4+Pj4gVGln
cmFuIHNob3dlZCBzb21lIHZlcnkgY29udmluY2luZyBudW1iZXJzIGZvciBhIGxhcmdlIGRpcmVj
dG9yeSB3aGVyZSB0aGUNCj4+Pj4+PiBkaWZmZXJlbmNlIGluIHBlcmZvcm1hbmNlIHdhcyBhbiBv
cmRlciBvZiBtYWduaXR1ZGUgaW1wcm92ZWQgYnkgdXNpbmcNCj4+Pj4+PiByZWFkZGlycGx1cyBp
bnN0ZWFkIG9mIHJlYWRkaXLigKYNCj4+Pj4+IA0KPj4+Pj4gSSdsbCB0cnkgdG8gcHJlc2VudCBh
IGJldHRlciBleHBsYW5hdGlvbi4gIFdoaWxlIGBscyAtbGAgaXMgd2Fsa2luZyB0aHJvdWdoDQo+
Pj4+PiBhIGRpcmVjdG9yeSByZXBlYXRlZGx5IGVudGVyaW5nIG5mc19yZWFkZGlyKCksIGEgQ1JF
QVRFIHJlc3BvbnNlIHNlbmQgdXMNCj4+Pj4+IHRocm91Z2ggbmZzX3Bvc3Rfb3BfdXBkYXRlX2lu
b2RlX2xvY2tlZCgpOg0KPj4+Pj4gDQo+Pj4+PiAxNTMxICAgICBpZiAoU19JU0RJUihpbm9kZS0+
aV9tb2RlKSkNCj4+Pj4+IDE1MzIgICAgICAgICBpbnZhbGlkIHw9IE5GU19JTk9fSU5WQUxJRF9E
QVRBOw0KPj4+Pj4gMTUzMyAgICAgbmZzX3NldF9jYWNoZV9pbnZhbGlkKGlub2RlLCBpbnZhbGlk
KTsNCj4+Pj4+IA0KPj4+Pj4gTm93LCB0aGUgbmV4dCBlbnRyeSBpbnRvIG5mc19yZWFkZGlyKCkg
aGFzIHVzIGRvIG5mc19yZXZhbGlkYXRlX21hcHBpbmcoKSwNCj4+Pj4+IHdoaWNoIHdpbGwgZG8g
bmZzX2ludmFsaWRhdGVfbWFwcGluZygpIGZvciB0aGUgZGlyZWN0b3J5LCBhbmQgc28gd2UgaGF2
ZSB0bw0KPj4+Pj4gc3RhcnQgb3ZlciB3aXRoIGNvb2tpZSAwIHNlbmRpbmcgUkVBRERJUlBMVVMg
dG8gcmUtZmlsbCB0aGUgZGlyZWN0b3J5J3MNCj4+Pj4+IG1hcHBpbmcgdG8gZ2V0IGJhY2sgdG8g
d2hlcmUgd2UgYXJlIGZvciB0aGUgY3VycmVudCBuZnNfcmVhZGRpcigpLg0KPj4+Pj4gDQo+Pj4+
PiBUaGlzIHByb2Nlc3MgcmVwZWF0cyBmb3IgZXZlcnkgZW50cnkgaW50byBuZnNfcmVhZGRpcigp
IGlmIHRoZSBkaXJlY3RvcnkNCj4+Pj4+IGtlZXBzIGdldHRpbmcgdXBkYXRlZCwgYW5kIGl0IGJl
Y29tZXMgbW9yZSBsaWtlbHkgdGhhdCBpdCB3aWxsIGJlIHVwZGF0ZWQgYXMNCj4+Pj4+IGVhY2gg
cGFzcyB0YWtlcyBsb25nZXIgYW5kIGxvbmdlciB0byByZS1maWxsIHRoZSBtYXBwaW5nIGFzIHRo
ZSBjdXJyZW50DQo+Pj4+PiBuZnNfcmVhZGRpcigpIGludm9jYXRpb24gaXMgZnVydGhlciBhbG9u
Zy4NCj4+Pj4+IA0KPj4+Pj4gUkVBRERJUlBMVVMgaXNuJ3QgdGhlIHByb2JsZW0sIHRoZSBwcm9i
bGVtIGlzIGludmFsaWRhdGluZyB0aGUgZGlyZWN0b3J5DQo+Pj4+PiBtYXBwaW5nIGluIHRoZSBt
aWRkbGUgb2YgYSBzZXJpZXMgb2YgZ2V0ZGVudHMoKSBpZiB3ZSBkbyBhIENSRUFURS4gIEFsc28s
IEkNCj4+Pj4+IHRoaW5rIGEgc2ltaWxhciB0aGluZyBoYXBwZW5zIGlmIHRoZSBkaXJlY3Rvcnkn
cyBtdGltZSBvciBjdGltZSBpcyB1cGRhdGVkIC0NCj4+Pj4+IGJ1dCBpbiB0aGF0IGNhc2Ugd2Ug
c2V0IE5GU19JTk9fSU5WQUxJRF9EQVRBIGJlY2F1c2UgdGhlIGNoYW5nZV9hdHRyDQo+Pj4+PiB1
cGRhdGVzLg0KPj4+Pj4gDQo+Pj4+PiBTbywgZm9yIGRpcmVjdG9yaWVzIHdpdGggYSBsYXJnZSBu
dW1iZXIgb2YgZW50cmllcyB0aGF0IHVwZGF0ZXMgb2Z0ZW4sIGl0IGNhbg0KPj4+Pj4gYmUgdmVy
eSBzbG93IHRvIGxpc3QgdGhlIGRpcmVjdG9yeS4NCj4+Pj4+IA0KPj4+Pj4gV2h5IGRpZCAzMTEz
MjRhZDE3MTMgY2hhbmdlIG5mc19yZWFkZGlyIGZyb20NCj4+Pj4+IA0KPj4+Pj4gaWYgKG5mc19h
dHRyaWJ1dGVfY2FjaGVfZXhwaXJlZChpbm9kZSkpDQo+Pj4+PiAgbmZzX3JldmFsaWRhdGVfbWFw
cGluZyhpbm9kZSwgZmlsZS0+Zl9tYXBwaW5nKTsNCj4+Pj4+IHRvDQo+Pj4+PiANCj4+Pj4+IGlm
IChuZnNfYXR0cmlidXRlX2NhY2hlX2V4cGlyZWQoaW5vZGUpIHx8IG5mc2ktPmNhY2hlX3ZhbGlk
aXR5ICYgTkZTX0lOT19JTlZBTElEX0RBVEEpDQo+Pj4+PiAgbmZzX3JldmFsaWRhdGVfbWFwcGlu
Zyhpbm9kZSwgZmlsZS0+Zl9tYXBwaW5nKTsNCj4+Pj4gDQo+Pj4+IEFzIHRoZSBjb21taXQgbWVz
c2FnZSBzYXlzLCB0aGUgd2hvbGUgcHVycG9zZSB3YXMgdG8gdXNlIFJFQURESVJQTFVTIGFzIGEg
c3Vic3RpdHV0ZSBmb3IgbXVsdGlwbGUgR0VUQVRUUiBjYWxscyB3aGVuIHRoZSBoZXVyaXN0aWMg
dGVsbHMgdXMgdGhhdCB0aGUgdXNlciBpcyBwZXJmb3JtaW5nIGFuIOKAmGxzIC1s4oCZIHN0eWxl
IG9mIHdvcmtsb2FkLg0KPj4+PiANCj4+Pj4+IA0KPj4+Pj4gLi4gYW5kIGNhbiB3ZSBnbyBiYWNr
IHRvIHRoZSB3YXkgaXQgd2FzIGJlZm9yZT8NCj4+Pj4gDQo+Pj4+IE5vdCB3aXRob3V0IHNsb3dp
bmcgZG93biDigJhscyAtbOKAmSBvbiBsYXJnZSBkaXJlY3Rvcmllcy4NCj4+Pj4gDQo+Pj4+PiAN
Cj4+Pj4+IE9LLi4gSSB1bmRlcnN0YW5kIHdoeSAtLSBpdCBpcyBtb3JlIGNvcnJlY3Qgc2luY2Ug
aWYgd2Uga25vdyB0aGUgZGlyZWN0b3J5IGhhcw0KPj4+Pj4gY2hhbmdlZCwgd2UgbWlnaHQgYXMg
d2VsbCBmZXRjaCB0aGUgY2hhbmdlLiAgT3RoZXJ3aXNlLCB3ZSBtaWdodCBiZSBjcmVhdGluZw0K
Pj4+Pj4gZmlsZXMgYW5kIHRoZW4gd29uZGVyaW5nIHdoeSB0aGV5IGFyZW4ndCBsaXN0ZWQuDQo+
Pj4+PiANCj4+Pj4+IEl0IG1pZ2h0IGJlIG5pY2VyIHRvIG5vdCBpbnZhbGlkYXRlIHRoZSBtYXBw
aW5nIHdlJ3JlIGN1cnJlbnRseSB1c2luZyBmb3INCj4+Pj4+IHJlYWRkaXIsIHRob3VnaC4gIE1h
eWJlIHRoZXJlJ3MgYSB3YXkgdG8ga2VlcCB0aGUgbWFwcGluZyBmb3IgdGhlIGN1cnJlbnRseQ0K
Pj4+Pj4gb3BlbmVkIGRpcmVjdG9yeSBhbmQgaW52YWxpZGF0ZSBpdCBvbmNlIGl0J3MgY2xvc2Vk
Lg0KPj4+PiANCj4+PiANCj4+Pj4gUE9TSVggcmVxdWlyZXMgdGhhdCB5b3UgcmV2YWxpZGF0ZSBv
biBvcGVuZGlyKCkgYW5kIHJld2luZGRpcigpLCBhbmQNCj4+Pj4gbGVhdmVzIGJlaGF2aW91ciB3
LnIudC4gZmlsZSBhZGRpdGlvbiBhbmQgcmVtb3ZhbCBhZnRlciB0aGUgY2FsbCB0bw0KPj4+PiBv
cGVuZGlyKCkvcmV3aW5kZGlyKCkgdW5kZWZpbmVkDQo+Pj4+IChodHRwOi8vcHVicy5vcGVuZ3Jv
dXAub3JnL29ubGluZXB1YnMvOTY5OTkxOTc5OS9mdW5jdGlvbnMvcmVhZGRpci5odG1sKS4NCj4+
PiANCj4+PiBJdCBpcyBvbmx5IHVuZGVmaW5lZCB3aGV0aGVyIHRoZSBhZGRlZCBvciByZW1vdmVk
IGVudHJ5IGlzIHJldHVybmVkLg0KPj4+IE90aGVyIGVudHJpZXMgc3RpbGwgbmVlZCB0byByZXR1
cm5lZCBleGFjdGx5IG9uY2UuDQo+Pj4gDQo+Pj4gSW4gdGhpcyBjYXNlIHdlJ3JlIHJlc3RhcnRp
bmcgdGhlIHJlYWQgb2YgdGhlIGRpcmVjdG9yeSBmcm9tIHNjcmF0Y2gtLUkNCj4+PiBkb24ndCB1
bmRlcnN0YW5kIGhvdyB0aGF0J3MgcG9zc2libGUgd2hpbGUgYXZvaWRpbmcgc2tpcHBlZCBvcg0K
Pj4+IGR1cGxpY2F0ZWQgZW50cmllcy4NCj4+PiANCj4+PiBTdXJlbHkgdGhlIG9ubHkgc2FmZSB0
aGluZyB0byBkbyBpcyB0byBjb250aW51ZSByZWFkaW5nIHVzaW5nIHRoZSBsYXN0DQo+Pj4gY29v
a2llIHJldHVybmVkIGZyb20gdGhlIHNlcnZlci4NCj4+IA0KPj4gV2h5PyBUaGUgY2xpZW50IHNo
b3VsZCBiZSBhYmxlIHRvIHJlc3RhcnQgdXNpbmcgYW55IGNvb2tpZSBhdCBhbnkgdGltZSwNCj4+
IGFuZCB3ZSByZWx5IG9uIHRoZSBjb29raWVzIGJlaW5nIHVuaXF1ZSB0byBlYWNoIGVudHJ5LiBJ
ZiB5b3Ugd2FudCBtb3JlDQo+PiByZWxheGVkIGNvb2tpZSBzZW1hbnRpY3MgdGhlbiBiZSBwcmVw
YXJlZCB0byBoYXZlIHRvIHNldCB1cCBhIHN0YXRlZnVsIE5GUw0KPj4gcmVhZGRpciBwcm90b2Nv
bC4NCj4gDQo+IEl0IGxvb2tzIHRvIG1lIGFzIGlmIEkgc2VuZCBpbiBhIG5vbi16ZXJvIHBvc2l0
aW9uIHRvIG5mc19yZWFkZGlyKCksIGJ1dCB0aGUgbWFwcGluZw0KPiBoYXMgYmVlbiBpbnZhbGlk
YXRlZCwgdGhlbiB0aGUgY2xpZW50IHN0YXJ0cyBvdmVyIGF0IGNvb2tpZSAwIHNlbmRpbmcNCj4g
UkVBRERJUnMgdG8gdGhlIHNlcnZlciBhbmQgcmVhZGluZyB0aGUgZGlyZWN0b3J5IGVudHJpZXMg
aW50byB0aGUgcGFnZSBjYWNoZQ0KPiBpbiBvcmRlciB0byBmaWd1cmUgb3V0IHdoaWNoIHBvc2l0
aW9uIG1hcHMgdG8gd2hpY2ggZW50cnkvY29va2llLg0KDQpZZXMuDQoNCj4gSXQgc2VlbXMgcG9z
c2libGUgdG8gZ2V0IGR1cGxpY2F0ZWQgb3Igc2tpcHBlZCBlbnRyaWVzLg0KDQpUaGVyZSBpcyBh
IGxvZ2ljYWwgc3RlcCBtaXNzaW5nIGhlcmUuIEhvdyBkbyB5b3UgZ2V0IGZyb20g4oCcd2UgY2Fu
IGVuZCB1cCByZS1maWxsaW5nIHRoZSBjYWNoZSBmcm9tIHNjcmF0Y2jigJ0gdG8g4oCcdGhlcmUg
Y2FuIGJlIGR1cGxpY2F0ZSBvciBza2lwcGVkIGVudHJpZXPigJ0/DQoNCg0K

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 14:02         ` Trond Myklebust
  2016-08-24 14:16           ` Benjamin Coddington
@ 2016-08-24 14:20           ` Fields Bruce James
  2016-08-24 14:26             ` Trond Myklebust
  1 sibling, 1 reply; 19+ messages in thread
From: Fields Bruce James @ 2016-08-24 14:20 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Coddington Benjamin, List Linux NFS Mailing

On Wed, Aug 24, 2016 at 02:02:04PM +0000, Trond Myklebust wrote:
> 
> > On Aug 24, 2016, at 09:56, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Wed, Aug 24, 2016 at 12:18:04PM +0000, Trond Myklebust wrote:
> >> 
> >>> On Aug 23, 2016, at 17:21, Benjamin Coddington <bcodding@redhat.com> wrote:
> >>> 
> >>> 
> >>> 
> >>> On 23 Aug 2016, at 11:36, Trond Myklebust wrote:
> >>> 
> >>>>> On Aug 23, 2016, at 11:09, Benjamin Coddington <bcodding@redhat.com> wrote:
> >>>>> 
> >>>>> Hi linux-nfs,
> >>>>> 
> >>>>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 'ls -l'
> >>>>> situations") changed when nfs_readdir() decides to revalidate the
> >>>>> directory's mapping, which contains all the entries.  In addition to just
> >>>>> checking if the attribute cache has expired, it includes a check to see if
> >>>>> NFS_INO_INVALID_DATA is set on the directory.
> >>>>> 
> >>>>> Well, customers that have directories with very many dentries and that same
> >>>>> directory's attributes are frequently updated are now grouchy that `ls -l`
> >>>>> takes so long since any update of the directory causes the mapping to be
> >>>>> invalidated and we have to start over filling the directory's mapping.
> >>>>> 
> >>>>> I actually haven't put real hard thought into it yet (because often for me
> >>>>> that just wastes a lot of time), so I am doing the lazy thing by asking this
> >>>>> question:
> >>>>> 
> >>>>> Can we go back to just the using the attribute cache timeout, or should we
> >>>>> get all heuristical about readdir?
> >>>>> 
> >>>> 
> >>>> We are all heuristical at this point. How are the heuristics failing?
> >>>> 
> >>>> The original problem those heuristics were designed to solve was that all
> >>>> the stat() calls took forever to complete, since they are all synchronous;
> >>>> Tigran showed some very convincing numbers for a large directory where the
> >>>> difference in performance was an order of magnitude improved by using
> >>>> readdirplus instead of readdir…
> >>> 
> >>> I'll try to present a better explanation.  While `ls -l` is walking through
> >>> a directory repeatedly entering nfs_readdir(), a CREATE response send us
> >>> through nfs_post_op_update_inode_locked():
> >>> 
> >>> 1531     if (S_ISDIR(inode->i_mode))
> >>> 1532         invalid |= NFS_INO_INVALID_DATA;
> >>> 1533     nfs_set_cache_invalid(inode, invalid);
> >>> 
> >>> Now, the next entry into nfs_readdir() has us do nfs_revalidate_mapping(),
> >>> which will do nfs_invalidate_mapping() for the directory, and so we have to
> >>> start over with cookie 0 sending READDIRPLUS to re-fill the directory's
> >>> mapping to get back to where we are for the current nfs_readdir().
> >>> 
> >>> This process repeats for every entry into nfs_readdir() if the directory
> >>> keeps getting updated, and it becomes more likely that it will be updated as
> >>> each pass takes longer and longer to re-fill the mapping as the current
> >>> nfs_readdir() invocation is further along.
> >>> 
> >>> READDIRPLUS isn't the problem, the problem is invalidating the directory
> >>> mapping in the middle of a series of getdents() if we do a CREATE.  Also, I
> >>> think a similar thing happens if the directory's mtime or ctime is updated -
> >>> but in that case we set NFS_INO_INVALID_DATA because the change_attr
> >>> updates.
> >>> 
> >>> So, for directories with a large number of entries that updates often, it can
> >>> be very slow to list the directory.
> >>> 
> >>> Why did 311324ad1713 change nfs_readdir from
> >>> 
> >>> if (nfs_attribute_cache_expired(inode))
> >>>   nfs_revalidate_mapping(inode, file->f_mapping);
> >>> to
> >>> 
> >>> if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & NFS_INO_INVALID_DATA)
> >>>   nfs_revalidate_mapping(inode, file->f_mapping);
> >> 
> >> As the commit message says, the whole purpose was to use READDIRPLUS as a substitute for multiple GETATTR calls when the heuristic tells us that the user is performing an ‘ls -l’ style of workload.
> >> 
> >>> 
> >>> .. and can we go back to the way it was before?
> >> 
> >> Not without slowing down ‘ls -l’ on large directories.
> >> 
> >>> 
> >>> OK.. I understand why -- it is more correct since if we know the directory has
> >>> changed, we might as well fetch the change.  Otherwise, we might be creating
> >>> files and then wondering why they aren't listed.
> >>> 
> >>> It might be nicer to not invalidate the mapping we're currently using for
> >>> readdir, though.  Maybe there's a way to keep the mapping for the currently
> >>> opened directory and invalidate it once it's closed.
> >> 
> > 
> >> POSIX requires that you revalidate on opendir() and rewinddir(), and
> >> leaves behaviour w.r.t. file addition and removal after the call to
> >> opendir()/rewinddir() undefined
> >> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html).
> > 
> > It is only undefined whether the added or removed entry is returned.
> > Other entries still need to returned exactly once.
> > 
> > In this case we're restarting the read of the directory from scratch--I
> > don't understand how that's possible while avoiding skipped or
> > duplicated entries.
> > 
> > Surely the only safe thing to do is to continue reading using the last
> > cookie returned from the server.
> 
> Why? The client should be able to restart using any cookie at any time, and we rely on the cookies being unique to each entry. If you want more relaxed cookie semantics then be prepared to have to set up a stateful NFS readdir protocol.

I may be misunderstanding what the client's doing.  If it's doing
something like this:

	- return first hundred entries to user
	- throw out cache, restart readdir with cookie 0.
	- skip first hundred entries, return entries 101-200 to user.

then it's risking skipped or duplicate entries.

I don't see how to avoid that unless you use the cookie returned after
entry 100 to read the subsequent entries.

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 14:20           ` Fields Bruce James
@ 2016-08-24 14:26             ` Trond Myklebust
  2016-08-24 14:40               ` J. Bruce Fields
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2016-08-24 14:26 UTC (permalink / raw)
  To: Fields Bruce James; +Cc: Coddington Benjamin, List Linux NFS Mailing

DQo+IE9uIEF1ZyAyNCwgMjAxNiwgYXQgMTA6MjAsIEZpZWxkcyBCcnVjZSBKYW1lcyA8YmZpZWxk
c0BmaWVsZHNlcy5vcmc+IHdyb3RlOg0KPiANCj4gT24gV2VkLCBBdWcgMjQsIDIwMTYgYXQgMDI6
MDI6MDRQTSArMDAwMCwgVHJvbmQgTXlrbGVidXN0IHdyb3RlOg0KPj4gDQo+Pj4gT24gQXVnIDI0
LCAyMDE2LCBhdCAwOTo1NiwgSi4gQnJ1Y2UgRmllbGRzIDxiZmllbGRzQGZpZWxkc2VzLm9yZz4g
d3JvdGU6DQo+Pj4gDQo+Pj4gT24gV2VkLCBBdWcgMjQsIDIwMTYgYXQgMTI6MTg6MDRQTSArMDAw
MCwgVHJvbmQgTXlrbGVidXN0IHdyb3RlOg0KPj4+PiANCj4+Pj4+IE9uIEF1ZyAyMywgMjAxNiwg
YXQgMTc6MjEsIEJlbmphbWluIENvZGRpbmd0b24gPGJjb2RkaW5nQHJlZGhhdC5jb20+IHdyb3Rl
Og0KPj4+Pj4gDQo+Pj4+PiANCj4+Pj4+IA0KPj4+Pj4gT24gMjMgQXVnIDIwMTYsIGF0IDExOjM2
LCBUcm9uZCBNeWtsZWJ1c3Qgd3JvdGU6DQo+Pj4+PiANCj4+Pj4+Pj4gT24gQXVnIDIzLCAyMDE2
LCBhdCAxMTowOSwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRpbmdAcmVkaGF0LmNvbT4gd3Jv
dGU6DQo+Pj4+Pj4+IA0KPj4+Pj4+PiBIaSBsaW51eC1uZnMsDQo+Pj4+Pj4+IA0KPj4+Pj4+PiAz
MTEzMjRhZDE3MTMgKCJORlM6IEJlIG1vcmUgYWdncmVzc2l2ZSBpbiB1c2luZyByZWFkZGlycGx1
cyBmb3IgJ2xzIC1sJw0KPj4+Pj4+PiBzaXR1YXRpb25zIikgY2hhbmdlZCB3aGVuIG5mc19yZWFk
ZGlyKCkgZGVjaWRlcyB0byByZXZhbGlkYXRlIHRoZQ0KPj4+Pj4+PiBkaXJlY3RvcnkncyBtYXBw
aW5nLCB3aGljaCBjb250YWlucyBhbGwgdGhlIGVudHJpZXMuICBJbiBhZGRpdGlvbiB0byBqdXN0
DQo+Pj4+Pj4+IGNoZWNraW5nIGlmIHRoZSBhdHRyaWJ1dGUgY2FjaGUgaGFzIGV4cGlyZWQsIGl0
IGluY2x1ZGVzIGEgY2hlY2sgdG8gc2VlIGlmDQo+Pj4+Pj4+IE5GU19JTk9fSU5WQUxJRF9EQVRB
IGlzIHNldCBvbiB0aGUgZGlyZWN0b3J5Lg0KPj4+Pj4+PiANCj4+Pj4+Pj4gV2VsbCwgY3VzdG9t
ZXJzIHRoYXQgaGF2ZSBkaXJlY3RvcmllcyB3aXRoIHZlcnkgbWFueSBkZW50cmllcyBhbmQgdGhh
dCBzYW1lDQo+Pj4+Pj4+IGRpcmVjdG9yeSdzIGF0dHJpYnV0ZXMgYXJlIGZyZXF1ZW50bHkgdXBk
YXRlZCBhcmUgbm93IGdyb3VjaHkgdGhhdCBgbHMgLWxgDQo+Pj4+Pj4+IHRha2VzIHNvIGxvbmcg
c2luY2UgYW55IHVwZGF0ZSBvZiB0aGUgZGlyZWN0b3J5IGNhdXNlcyB0aGUgbWFwcGluZyB0byBi
ZQ0KPj4+Pj4+PiBpbnZhbGlkYXRlZCBhbmQgd2UgaGF2ZSB0byBzdGFydCBvdmVyIGZpbGxpbmcg
dGhlIGRpcmVjdG9yeSdzIG1hcHBpbmcuDQo+Pj4+Pj4+IA0KPj4+Pj4+PiBJIGFjdHVhbGx5IGhh
dmVuJ3QgcHV0IHJlYWwgaGFyZCB0aG91Z2h0IGludG8gaXQgeWV0IChiZWNhdXNlIG9mdGVuIGZv
ciBtZQ0KPj4+Pj4+PiB0aGF0IGp1c3Qgd2FzdGVzIGEgbG90IG9mIHRpbWUpLCBzbyBJIGFtIGRv
aW5nIHRoZSBsYXp5IHRoaW5nIGJ5IGFza2luZyB0aGlzDQo+Pj4+Pj4+IHF1ZXN0aW9uOg0KPj4+
Pj4+PiANCj4+Pj4+Pj4gQ2FuIHdlIGdvIGJhY2sgdG8ganVzdCB0aGUgdXNpbmcgdGhlIGF0dHJp
YnV0ZSBjYWNoZSB0aW1lb3V0LCBvciBzaG91bGQgd2UNCj4+Pj4+Pj4gZ2V0IGFsbCBoZXVyaXN0
aWNhbCBhYm91dCByZWFkZGlyPw0KPj4+Pj4+PiANCj4+Pj4+PiANCj4+Pj4+PiBXZSBhcmUgYWxs
IGhldXJpc3RpY2FsIGF0IHRoaXMgcG9pbnQuIEhvdyBhcmUgdGhlIGhldXJpc3RpY3MgZmFpbGlu
Zz8NCj4+Pj4+PiANCj4+Pj4+PiBUaGUgb3JpZ2luYWwgcHJvYmxlbSB0aG9zZSBoZXVyaXN0aWNz
IHdlcmUgZGVzaWduZWQgdG8gc29sdmUgd2FzIHRoYXQgYWxsDQo+Pj4+Pj4gdGhlIHN0YXQoKSBj
YWxscyB0b29rIGZvcmV2ZXIgdG8gY29tcGxldGUsIHNpbmNlIHRoZXkgYXJlIGFsbCBzeW5jaHJv
bm91czsNCj4+Pj4+PiBUaWdyYW4gc2hvd2VkIHNvbWUgdmVyeSBjb252aW5jaW5nIG51bWJlcnMg
Zm9yIGEgbGFyZ2UgZGlyZWN0b3J5IHdoZXJlIHRoZQ0KPj4+Pj4+IGRpZmZlcmVuY2UgaW4gcGVy
Zm9ybWFuY2Ugd2FzIGFuIG9yZGVyIG9mIG1hZ25pdHVkZSBpbXByb3ZlZCBieSB1c2luZw0KPj4+
Pj4+IHJlYWRkaXJwbHVzIGluc3RlYWQgb2YgcmVhZGRpcuKApg0KPj4+Pj4gDQo+Pj4+PiBJJ2xs
IHRyeSB0byBwcmVzZW50IGEgYmV0dGVyIGV4cGxhbmF0aW9uLiAgV2hpbGUgYGxzIC1sYCBpcyB3
YWxraW5nIHRocm91Z2gNCj4+Pj4+IGEgZGlyZWN0b3J5IHJlcGVhdGVkbHkgZW50ZXJpbmcgbmZz
X3JlYWRkaXIoKSwgYSBDUkVBVEUgcmVzcG9uc2Ugc2VuZCB1cw0KPj4+Pj4gdGhyb3VnaCBuZnNf
cG9zdF9vcF91cGRhdGVfaW5vZGVfbG9ja2VkKCk6DQo+Pj4+PiANCj4+Pj4+IDE1MzEgICAgIGlm
IChTX0lTRElSKGlub2RlLT5pX21vZGUpKQ0KPj4+Pj4gMTUzMiAgICAgICAgIGludmFsaWQgfD0g
TkZTX0lOT19JTlZBTElEX0RBVEE7DQo+Pj4+PiAxNTMzICAgICBuZnNfc2V0X2NhY2hlX2ludmFs
aWQoaW5vZGUsIGludmFsaWQpOw0KPj4+Pj4gDQo+Pj4+PiBOb3csIHRoZSBuZXh0IGVudHJ5IGlu
dG8gbmZzX3JlYWRkaXIoKSBoYXMgdXMgZG8gbmZzX3JldmFsaWRhdGVfbWFwcGluZygpLA0KPj4+
Pj4gd2hpY2ggd2lsbCBkbyBuZnNfaW52YWxpZGF0ZV9tYXBwaW5nKCkgZm9yIHRoZSBkaXJlY3Rv
cnksIGFuZCBzbyB3ZSBoYXZlIHRvDQo+Pj4+PiBzdGFydCBvdmVyIHdpdGggY29va2llIDAgc2Vu
ZGluZyBSRUFERElSUExVUyB0byByZS1maWxsIHRoZSBkaXJlY3Rvcnkncw0KPj4+Pj4gbWFwcGlu
ZyB0byBnZXQgYmFjayB0byB3aGVyZSB3ZSBhcmUgZm9yIHRoZSBjdXJyZW50IG5mc19yZWFkZGly
KCkuDQo+Pj4+PiANCj4+Pj4+IFRoaXMgcHJvY2VzcyByZXBlYXRzIGZvciBldmVyeSBlbnRyeSBp
bnRvIG5mc19yZWFkZGlyKCkgaWYgdGhlIGRpcmVjdG9yeQ0KPj4+Pj4ga2VlcHMgZ2V0dGluZyB1
cGRhdGVkLCBhbmQgaXQgYmVjb21lcyBtb3JlIGxpa2VseSB0aGF0IGl0IHdpbGwgYmUgdXBkYXRl
ZCBhcw0KPj4+Pj4gZWFjaCBwYXNzIHRha2VzIGxvbmdlciBhbmQgbG9uZ2VyIHRvIHJlLWZpbGwg
dGhlIG1hcHBpbmcgYXMgdGhlIGN1cnJlbnQNCj4+Pj4+IG5mc19yZWFkZGlyKCkgaW52b2NhdGlv
biBpcyBmdXJ0aGVyIGFsb25nLg0KPj4+Pj4gDQo+Pj4+PiBSRUFERElSUExVUyBpc24ndCB0aGUg
cHJvYmxlbSwgdGhlIHByb2JsZW0gaXMgaW52YWxpZGF0aW5nIHRoZSBkaXJlY3RvcnkNCj4+Pj4+
IG1hcHBpbmcgaW4gdGhlIG1pZGRsZSBvZiBhIHNlcmllcyBvZiBnZXRkZW50cygpIGlmIHdlIGRv
IGEgQ1JFQVRFLiAgQWxzbywgSQ0KPj4+Pj4gdGhpbmsgYSBzaW1pbGFyIHRoaW5nIGhhcHBlbnMg
aWYgdGhlIGRpcmVjdG9yeSdzIG10aW1lIG9yIGN0aW1lIGlzIHVwZGF0ZWQgLQ0KPj4+Pj4gYnV0
IGluIHRoYXQgY2FzZSB3ZSBzZXQgTkZTX0lOT19JTlZBTElEX0RBVEEgYmVjYXVzZSB0aGUgY2hh
bmdlX2F0dHINCj4+Pj4+IHVwZGF0ZXMuDQo+Pj4+PiANCj4+Pj4+IFNvLCBmb3IgZGlyZWN0b3Jp
ZXMgd2l0aCBhIGxhcmdlIG51bWJlciBvZiBlbnRyaWVzIHRoYXQgdXBkYXRlcyBvZnRlbiwgaXQg
Y2FuDQo+Pj4+PiBiZSB2ZXJ5IHNsb3cgdG8gbGlzdCB0aGUgZGlyZWN0b3J5Lg0KPj4+Pj4gDQo+
Pj4+PiBXaHkgZGlkIDMxMTMyNGFkMTcxMyBjaGFuZ2UgbmZzX3JlYWRkaXIgZnJvbQ0KPj4+Pj4g
DQo+Pj4+PiBpZiAobmZzX2F0dHJpYnV0ZV9jYWNoZV9leHBpcmVkKGlub2RlKSkNCj4+Pj4+ICBu
ZnNfcmV2YWxpZGF0ZV9tYXBwaW5nKGlub2RlLCBmaWxlLT5mX21hcHBpbmcpOw0KPj4+Pj4gdG8N
Cj4+Pj4+IA0KPj4+Pj4gaWYgKG5mc19hdHRyaWJ1dGVfY2FjaGVfZXhwaXJlZChpbm9kZSkgfHwg
bmZzaS0+Y2FjaGVfdmFsaWRpdHkgJiBORlNfSU5PX0lOVkFMSURfREFUQSkNCj4+Pj4+ICBuZnNf
cmV2YWxpZGF0ZV9tYXBwaW5nKGlub2RlLCBmaWxlLT5mX21hcHBpbmcpOw0KPj4+PiANCj4+Pj4g
QXMgdGhlIGNvbW1pdCBtZXNzYWdlIHNheXMsIHRoZSB3aG9sZSBwdXJwb3NlIHdhcyB0byB1c2Ug
UkVBRERJUlBMVVMgYXMgYSBzdWJzdGl0dXRlIGZvciBtdWx0aXBsZSBHRVRBVFRSIGNhbGxzIHdo
ZW4gdGhlIGhldXJpc3RpYyB0ZWxscyB1cyB0aGF0IHRoZSB1c2VyIGlzIHBlcmZvcm1pbmcgYW4g
4oCYbHMgLWzigJkgc3R5bGUgb2Ygd29ya2xvYWQuDQo+Pj4+IA0KPj4+Pj4gDQo+Pj4+PiAuLiBh
bmQgY2FuIHdlIGdvIGJhY2sgdG8gdGhlIHdheSBpdCB3YXMgYmVmb3JlPw0KPj4+PiANCj4+Pj4g
Tm90IHdpdGhvdXQgc2xvd2luZyBkb3duIOKAmGxzIC1s4oCZIG9uIGxhcmdlIGRpcmVjdG9yaWVz
Lg0KPj4+PiANCj4+Pj4+IA0KPj4+Pj4gT0suLiBJIHVuZGVyc3RhbmQgd2h5IC0tIGl0IGlzIG1v
cmUgY29ycmVjdCBzaW5jZSBpZiB3ZSBrbm93IHRoZSBkaXJlY3RvcnkgaGFzDQo+Pj4+PiBjaGFu
Z2VkLCB3ZSBtaWdodCBhcyB3ZWxsIGZldGNoIHRoZSBjaGFuZ2UuICBPdGhlcndpc2UsIHdlIG1p
Z2h0IGJlIGNyZWF0aW5nDQo+Pj4+PiBmaWxlcyBhbmQgdGhlbiB3b25kZXJpbmcgd2h5IHRoZXkg
YXJlbid0IGxpc3RlZC4NCj4+Pj4+IA0KPj4+Pj4gSXQgbWlnaHQgYmUgbmljZXIgdG8gbm90IGlu
dmFsaWRhdGUgdGhlIG1hcHBpbmcgd2UncmUgY3VycmVudGx5IHVzaW5nIGZvcg0KPj4+Pj4gcmVh
ZGRpciwgdGhvdWdoLiAgTWF5YmUgdGhlcmUncyBhIHdheSB0byBrZWVwIHRoZSBtYXBwaW5nIGZv
ciB0aGUgY3VycmVudGx5DQo+Pj4+PiBvcGVuZWQgZGlyZWN0b3J5IGFuZCBpbnZhbGlkYXRlIGl0
IG9uY2UgaXQncyBjbG9zZWQuDQo+Pj4+IA0KPj4+IA0KPj4+PiBQT1NJWCByZXF1aXJlcyB0aGF0
IHlvdSByZXZhbGlkYXRlIG9uIG9wZW5kaXIoKSBhbmQgcmV3aW5kZGlyKCksIGFuZA0KPj4+PiBs
ZWF2ZXMgYmVoYXZpb3VyIHcuci50LiBmaWxlIGFkZGl0aW9uIGFuZCByZW1vdmFsIGFmdGVyIHRo
ZSBjYWxsIHRvDQo+Pj4+IG9wZW5kaXIoKS9yZXdpbmRkaXIoKSB1bmRlZmluZWQNCj4+Pj4gKGh0
dHA6Ly9wdWJzLm9wZW5ncm91cC5vcmcvb25saW5lcHVicy85Njk5OTE5Nzk5L2Z1bmN0aW9ucy9y
ZWFkZGlyLmh0bWwpLg0KPj4+IA0KPj4+IEl0IGlzIG9ubHkgdW5kZWZpbmVkIHdoZXRoZXIgdGhl
IGFkZGVkIG9yIHJlbW92ZWQgZW50cnkgaXMgcmV0dXJuZWQuDQo+Pj4gT3RoZXIgZW50cmllcyBz
dGlsbCBuZWVkIHRvIHJldHVybmVkIGV4YWN0bHkgb25jZS4NCj4+PiANCj4+PiBJbiB0aGlzIGNh
c2Ugd2UncmUgcmVzdGFydGluZyB0aGUgcmVhZCBvZiB0aGUgZGlyZWN0b3J5IGZyb20gc2NyYXRj
aC0tSQ0KPj4+IGRvbid0IHVuZGVyc3RhbmQgaG93IHRoYXQncyBwb3NzaWJsZSB3aGlsZSBhdm9p
ZGluZyBza2lwcGVkIG9yDQo+Pj4gZHVwbGljYXRlZCBlbnRyaWVzLg0KPj4+IA0KPj4+IFN1cmVs
eSB0aGUgb25seSBzYWZlIHRoaW5nIHRvIGRvIGlzIHRvIGNvbnRpbnVlIHJlYWRpbmcgdXNpbmcg
dGhlIGxhc3QNCj4+PiBjb29raWUgcmV0dXJuZWQgZnJvbSB0aGUgc2VydmVyLg0KPj4gDQo+PiBX
aHk/IFRoZSBjbGllbnQgc2hvdWxkIGJlIGFibGUgdG8gcmVzdGFydCB1c2luZyBhbnkgY29va2ll
IGF0IGFueSB0aW1lLCBhbmQgd2UgcmVseSBvbiB0aGUgY29va2llcyBiZWluZyB1bmlxdWUgdG8g
ZWFjaCBlbnRyeS4gSWYgeW91IHdhbnQgbW9yZSByZWxheGVkIGNvb2tpZSBzZW1hbnRpY3MgdGhl
biBiZSBwcmVwYXJlZCB0byBoYXZlIHRvIHNldCB1cCBhIHN0YXRlZnVsIE5GUyByZWFkZGlyIHBy
b3RvY29sLg0KPiANCj4gSSBtYXkgYmUgbWlzdW5kZXJzdGFuZGluZyB3aGF0IHRoZSBjbGllbnQn
cyBkb2luZy4gIElmIGl0J3MgZG9pbmcNCj4gc29tZXRoaW5nIGxpa2UgdGhpczoNCj4gDQo+IAkt
IHJldHVybiBmaXJzdCBodW5kcmVkIGVudHJpZXMgdG8gdXNlcg0KPiAJLSB0aHJvdyBvdXQgY2Fj
aGUsIHJlc3RhcnQgcmVhZGRpciB3aXRoIGNvb2tpZSAwLg0KPiAJLSBza2lwIGZpcnN0IGh1bmRy
ZWQgZW50cmllcywgcmV0dXJuIGVudHJpZXMgMTAxLTIwMCB0byB1c2VyLg0KDQpOby4gSWYgdGhl
IGNsaWVudCBoYXMgYSBjYWNoZWQgY29va2llIGFjdGluZyBhcyB0aGUgY3Vyc29yLCB0aGVuIHdl
IGxvb2sgZm9yIHRoYXQgY29va2llIGluIHRoZSBuZXcgc3RyZWFtLiBJZiB3ZSBkb27igJl0IGZp
bmQgdGhhdCBjb29raWUsIHRoZW4gd2UgZG8gYW5vdGhlciByZWFkZGlyIHdoZXJlIHRoYXQgY29v
a2llIGFjdHMgYXMgYW4gYXJndW1lbnQsIGFuZCB1c2UgdGhhdCB0byBmaWd1cmUgb3V0IHdoZXJl
IHRoZSBzdHJlYW0gbm93IGNvbnRpbnVlcy4NCg0KDQo+IA0KPiB0aGVuIGl0J3Mgcmlza2luZyBz
a2lwcGVkIG9yIGR1cGxpY2F0ZSBlbnRyaWVzLg0KPiANCj4gSSBkb24ndCBzZWUgaG93IHRvIGF2
b2lkIHRoYXQgdW5sZXNzIHlvdSB1c2UgdGhlIGNvb2tpZSByZXR1cm5lZCBhZnRlcg0KPiBlbnRy
eSAxMDAgdG8gcmVhZCB0aGUgc3Vic2VxdWVudCBlbnRyaWVzLg0KPiANCg0KU2VlIGFib3ZlDQoN
Cg==

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 14:26             ` Trond Myklebust
@ 2016-08-24 14:40               ` J. Bruce Fields
  2016-08-24 14:53                 ` Trond Myklebust
  0 siblings, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2016-08-24 14:40 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Coddington Benjamin, List Linux NFS Mailing

On Wed, Aug 24, 2016 at 02:26:21PM +0000, Trond Myklebust wrote:
> 
> > On Aug 24, 2016, at 10:20, <bfields@fieldses.org> wrote:
> > I may be misunderstanding what the client's doing.  If it's doing
> > something like this:
> > 
> > 	- return first hundred entries to user
> > 	- throw out cache, restart readdir with cookie 0.
> > 	- skip first hundred entries, return entries 101-200 to user.
> 
> No. If the client has a cached cookie acting as the cursor, then we look for that cookie in the new stream. If we don’t find that cookie, then we do another readdir where that cookie acts as an argument, and use that to figure out where the stream now continues.

Got it, thanks, so the struct file caches the real server side cookie
even after you throw out the cached entries?  OK, I guess I need to make
another attempt at understanding the actual code, apologies.

(Doesn't that mean that in the case your last position was a deleted
file, you're probably going to have to read the entire directory to find
out the cookie's not there?  It seems so much more painful than starting
off with the last cookie.)

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 14:40               ` J. Bruce Fields
@ 2016-08-24 14:53                 ` Trond Myklebust
  2016-08-24 15:16                   ` Fields Bruce James
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2016-08-24 14:53 UTC (permalink / raw)
  To: Fields Bruce James; +Cc: Coddington Benjamin, List Linux NFS Mailing

DQo+IE9uIEF1ZyAyNCwgMjAxNiwgYXQgMTA6NDAsIEouIEJydWNlIEZpZWxkcyA8YmZpZWxkc0Bm
aWVsZHNlcy5vcmc+IHdyb3RlOg0KPiANCj4gT24gV2VkLCBBdWcgMjQsIDIwMTYgYXQgMDI6MjY6
MjFQTSArMDAwMCwgVHJvbmQgTXlrbGVidXN0IHdyb3RlOg0KPj4gDQo+Pj4gT24gQXVnIDI0LCAy
MDE2LCBhdCAxMDoyMCwgPGJmaWVsZHNAZmllbGRzZXMub3JnPiB3cm90ZToNCj4+PiBJIG1heSBi
ZSBtaXN1bmRlcnN0YW5kaW5nIHdoYXQgdGhlIGNsaWVudCdzIGRvaW5nLiAgSWYgaXQncyBkb2lu
Zw0KPj4+IHNvbWV0aGluZyBsaWtlIHRoaXM6DQo+Pj4gDQo+Pj4gCS0gcmV0dXJuIGZpcnN0IGh1
bmRyZWQgZW50cmllcyB0byB1c2VyDQo+Pj4gCS0gdGhyb3cgb3V0IGNhY2hlLCByZXN0YXJ0IHJl
YWRkaXIgd2l0aCBjb29raWUgMC4NCj4+PiAJLSBza2lwIGZpcnN0IGh1bmRyZWQgZW50cmllcywg
cmV0dXJuIGVudHJpZXMgMTAxLTIwMCB0byB1c2VyLg0KPj4gDQo+PiBOby4gSWYgdGhlIGNsaWVu
dCBoYXMgYSBjYWNoZWQgY29va2llIGFjdGluZyBhcyB0aGUgY3Vyc29yLCB0aGVuIHdlIGxvb2sg
Zm9yIHRoYXQgY29va2llIGluIHRoZSBuZXcgc3RyZWFtLiBJZiB3ZSBkb27igJl0IGZpbmQgdGhh
dCBjb29raWUsIHRoZW4gd2UgZG8gYW5vdGhlciByZWFkZGlyIHdoZXJlIHRoYXQgY29va2llIGFj
dHMgYXMgYW4gYXJndW1lbnQsIGFuZCB1c2UgdGhhdCB0byBmaWd1cmUgb3V0IHdoZXJlIHRoZSBz
dHJlYW0gbm93IGNvbnRpbnVlcy4NCj4gDQo+IEdvdCBpdCwgdGhhbmtzLCBzbyB0aGUgc3RydWN0
IGZpbGUgY2FjaGVzIHRoZSByZWFsIHNlcnZlciBzaWRlIGNvb2tpZQ0KPiBldmVuIGFmdGVyIHlv
dSB0aHJvdyBvdXQgdGhlIGNhY2hlZCBlbnRyaWVzPyAgT0ssIEkgZ3Vlc3MgSSBuZWVkIHRvIG1h
a2UNCj4gYW5vdGhlciBhdHRlbXB0IGF0IHVuZGVyc3RhbmRpbmcgdGhlIGFjdHVhbCBjb2RlLCBh
cG9sb2dpZXMuDQo+IA0KPiAoRG9lc24ndCB0aGF0IG1lYW4gdGhhdCBpbiB0aGUgY2FzZSB5b3Vy
IGxhc3QgcG9zaXRpb24gd2FzIGEgZGVsZXRlZA0KPiBmaWxlLCB5b3UncmUgcHJvYmFibHkgZ29p
bmcgdG8gaGF2ZSB0byByZWFkIHRoZSBlbnRpcmUgZGlyZWN0b3J5IHRvIGZpbmQNCj4gb3V0IHRo
ZSBjb29raWUncyBub3QgdGhlcmU/ICBJdCBzZWVtcyBzbyBtdWNoIG1vcmUgcGFpbmZ1bCB0aGFu
IHN0YXJ0aW5nDQo+IG9mZiB3aXRoIHRoZSBsYXN0IGNvb2tpZS4pDQoNCkhvdyBkbyB5b3UgY2Fj
aGUgdGhlIHJlc3VsdHMgb2Ygc3VjaCBhIHJlYWRkaXI/IFlvdSdsbCBoYXZlIG5vIHJlZmVyZW5j
ZSBwb2ludCB0byBmaWd1cmUgb3V0IHdoZXJlIHlvdSBhcmUgaW4gdGhlIHN0cmVhbS4NCkFsc28s
IGhvdyBkbyB5b3UgY29udmVydCB0aGF0IGNvb2tpZSBpbnRvIGFuIG9mZnNldCB0aGF0IHRlbGxk
aXIoKSBjYW4gdXNlPw0KDQoNCg==

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 14:53                 ` Trond Myklebust
@ 2016-08-24 15:16                   ` Fields Bruce James
  0 siblings, 0 replies; 19+ messages in thread
From: Fields Bruce James @ 2016-08-24 15:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Coddington Benjamin, List Linux NFS Mailing

On Wed, Aug 24, 2016 at 02:53:10PM +0000, Trond Myklebust wrote:
> 
> > On Aug 24, 2016, at 10:40, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Wed, Aug 24, 2016 at 02:26:21PM +0000, Trond Myklebust wrote:
> >> 
> >>> On Aug 24, 2016, at 10:20, <bfields@fieldses.org> wrote:
> >>> I may be misunderstanding what the client's doing.  If it's doing
> >>> something like this:
> >>> 
> >>> 	- return first hundred entries to user
> >>> 	- throw out cache, restart readdir with cookie 0.
> >>> 	- skip first hundred entries, return entries 101-200 to user.
> >> 
> >> No. If the client has a cached cookie acting as the cursor, then we look for that cookie in the new stream. If we don’t find that cookie, then we do another readdir where that cookie acts as an argument, and use that to figure out where the stream now continues.
> > 
> > Got it, thanks, so the struct file caches the real server side cookie
> > even after you throw out the cached entries?  OK, I guess I need to make
> > another attempt at understanding the actual code, apologies.
> > 
> > (Doesn't that mean that in the case your last position was a deleted
> > file, you're probably going to have to read the entire directory to find
> > out the cookie's not there?  It seems so much more painful than starting
> > off with the last cookie.)
> 
> How do you cache the results of such a readdir? You'll have no reference point to figure out where you are in the stream.
> Also, how do you convert that cookie into an offset that telldir() can use?

No idea.  But then, I don't really understand how we're going to do that
even after reading the whole directory.  Any telldir offsets that we've
returned previously are in the future going to index into a completely
new array, so they seem kind of meaningless.

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: I can't get no readdir satisfaction
  2016-08-24 14:19             ` Trond Myklebust
@ 2016-08-24 15:18               ` Benjamin Coddington
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Coddington @ 2016-08-24 15:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Fields Bruce James, List Linux NFS Mailing


On 24 Aug 2016, at 10:19, Trond Myklebust wrote:

>> On Aug 24, 2016, at 10:16, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> On 24 Aug 2016, at 10:02, Trond Myklebust wrote:
>>
>>>> On Aug 24, 2016, at 09:56, J. Bruce Fields <bfields@fieldses.org> 
>>>> wrote:
>>>>
>>>> On Wed, Aug 24, 2016 at 12:18:04PM +0000, Trond Myklebust wrote:
>>>>>
>>>>>> On Aug 23, 2016, at 17:21, Benjamin Coddington 
>>>>>> <bcodding@redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 23 Aug 2016, at 11:36, Trond Myklebust wrote:
>>>>>>
>>>>>>>> On Aug 23, 2016, at 11:09, Benjamin Coddington 
>>>>>>>> <bcodding@redhat.com> wrote:
>>>>>>>>
>>>>>>>> Hi linux-nfs,
>>>>>>>>
>>>>>>>> 311324ad1713 ("NFS: Be more aggressive in using readdirplus for 
>>>>>>>> 'ls -l'
>>>>>>>> situations") changed when nfs_readdir() decides to revalidate 
>>>>>>>> the
>>>>>>>> directory's mapping, which contains all the entries.  In 
>>>>>>>> addition to just
>>>>>>>> checking if the attribute cache has expired, it includes a 
>>>>>>>> check to see if
>>>>>>>> NFS_INO_INVALID_DATA is set on the directory.
>>>>>>>>
>>>>>>>> Well, customers that have directories with very many dentries 
>>>>>>>> and that same
>>>>>>>> directory's attributes are frequently updated are now grouchy 
>>>>>>>> that `ls -l`
>>>>>>>> takes so long since any update of the directory causes the 
>>>>>>>> mapping to be
>>>>>>>> invalidated and we have to start over filling the directory's 
>>>>>>>> mapping.
>>>>>>>>
>>>>>>>> I actually haven't put real hard thought into it yet (because 
>>>>>>>> often for me
>>>>>>>> that just wastes a lot of time), so I am doing the lazy thing 
>>>>>>>> by asking this
>>>>>>>> question:
>>>>>>>>
>>>>>>>> Can we go back to just the using the attribute cache timeout, 
>>>>>>>> or should we
>>>>>>>> get all heuristical about readdir?
>>>>>>>>
>>>>>>>
>>>>>>> We are all heuristical at this point. How are the heuristics 
>>>>>>> failing?
>>>>>>>
>>>>>>> The original problem those heuristics were designed to solve was 
>>>>>>> that all
>>>>>>> the stat() calls took forever to complete, since they are all 
>>>>>>> synchronous;
>>>>>>> Tigran showed some very convincing numbers for a large directory 
>>>>>>> where the
>>>>>>> difference in performance was an order of magnitude improved by 
>>>>>>> using
>>>>>>> readdirplus instead of readdir…
>>>>>>
>>>>>> I'll try to present a better explanation.  While `ls -l` is 
>>>>>> walking through
>>>>>> a directory repeatedly entering nfs_readdir(), a CREATE response 
>>>>>> send us
>>>>>> through nfs_post_op_update_inode_locked():
>>>>>>
>>>>>> 1531     if (S_ISDIR(inode->i_mode))
>>>>>> 1532         invalid |= NFS_INO_INVALID_DATA;
>>>>>> 1533     nfs_set_cache_invalid(inode, invalid);
>>>>>>
>>>>>> Now, the next entry into nfs_readdir() has us do 
>>>>>> nfs_revalidate_mapping(),
>>>>>> which will do nfs_invalidate_mapping() for the directory, and so 
>>>>>> we have to
>>>>>> start over with cookie 0 sending READDIRPLUS to re-fill the 
>>>>>> directory's
>>>>>> mapping to get back to where we are for the current 
>>>>>> nfs_readdir().
>>>>>>
>>>>>> This process repeats for every entry into nfs_readdir() if the 
>>>>>> directory
>>>>>> keeps getting updated, and it becomes more likely that it will be 
>>>>>> updated as
>>>>>> each pass takes longer and longer to re-fill the mapping as the 
>>>>>> current
>>>>>> nfs_readdir() invocation is further along.
>>>>>>
>>>>>> READDIRPLUS isn't the problem, the problem is invalidating the 
>>>>>> directory
>>>>>> mapping in the middle of a series of getdents() if we do a 
>>>>>> CREATE.  Also, I
>>>>>> think a similar thing happens if the directory's mtime or ctime 
>>>>>> is updated -
>>>>>> but in that case we set NFS_INO_INVALID_DATA because the 
>>>>>> change_attr
>>>>>> updates.
>>>>>>
>>>>>> So, for directories with a large number of entries that updates 
>>>>>> often, it can
>>>>>> be very slow to list the directory.
>>>>>>
>>>>>> Why did 311324ad1713 change nfs_readdir from
>>>>>>
>>>>>> if (nfs_attribute_cache_expired(inode))
>>>>>>  nfs_revalidate_mapping(inode, file->f_mapping);
>>>>>> to
>>>>>>
>>>>>> if (nfs_attribute_cache_expired(inode) || nfsi->cache_validity & 
>>>>>> NFS_INO_INVALID_DATA)
>>>>>>  nfs_revalidate_mapping(inode, file->f_mapping);
>>>>>
>>>>> As the commit message says, the whole purpose was to use 
>>>>> READDIRPLUS as a substitute for multiple GETATTR calls when the 
>>>>> heuristic tells us that the user is performing an ‘ls -l’ 
>>>>> style of workload.
>>>>>
>>>>>>
>>>>>> .. and can we go back to the way it was before?
>>>>>
>>>>> Not without slowing down ‘ls -l’ on large directories.
>>>>>
>>>>>>
>>>>>> OK.. I understand why -- it is more correct since if we know the 
>>>>>> directory has
>>>>>> changed, we might as well fetch the change.  Otherwise, we might 
>>>>>> be creating
>>>>>> files and then wondering why they aren't listed.
>>>>>>
>>>>>> It might be nicer to not invalidate the mapping we're currently 
>>>>>> using for
>>>>>> readdir, though.  Maybe there's a way to keep the mapping for the 
>>>>>> currently
>>>>>> opened directory and invalidate it once it's closed.
>>>>>
>>>>
>>>>> POSIX requires that you revalidate on opendir() and rewinddir(), 
>>>>> and
>>>>> leaves behaviour w.r.t. file addition and removal after the call 
>>>>> to
>>>>> opendir()/rewinddir() undefined
>>>>> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html).
>>>>
>>>> It is only undefined whether the added or removed entry is 
>>>> returned.
>>>> Other entries still need to returned exactly once.
>>>>
>>>> In this case we're restarting the read of the directory from 
>>>> scratch--I
>>>> don't understand how that's possible while avoiding skipped or
>>>> duplicated entries.
>>>>
>>>> Surely the only safe thing to do is to continue reading using the 
>>>> last
>>>> cookie returned from the server.
>>>
>>> Why? The client should be able to restart using any cookie at any 
>>> time,
>>> and we rely on the cookies being unique to each entry. If you want 
>>> more
>>> relaxed cookie semantics then be prepared to have to set up a 
>>> stateful NFS
>>> readdir protocol.
>>
>> It looks to me as if I send in a non-zero position to nfs_readdir(), 
>> but the mapping
>> has been invalidated, then the client starts over at cookie 0 sending
>> READDIRs to the server and reading the directory entries into the 
>> page cache
>> in order to figure out which position maps to which entry/cookie.
>
> Yes.
>
>> It seems possible to get duplicated or skipped entries.
>
> There is a logical step missing here. How do you get from “we can 
> end up
> re-filling the cache from scratch” to “there can be duplicate or 
> skipped
> entries”?

Because I thought we were searching for the position, not the cookie.  I
missed that we keep the last position/cookie around.  Thanks, I need to 
read
more carefully.

It seems a seekdir would clear that cookie, then we might have
duplicate/missing entries.. but this is getting far from the original 
point.
There's plenty of previous discussion on telldir/seekdir as you point 
out,
so I'll go off and read that.

Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-08-24 15:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-23 15:09 I can't get no readdir satisfaction Benjamin Coddington
2016-08-23 15:36 ` Trond Myklebust
2016-08-23 21:21   ` Benjamin Coddington
2016-08-24 12:18     ` Trond Myklebust
2016-08-24 13:15       ` Benjamin Coddington
2016-08-24 13:39         ` Trond Myklebust
2016-08-24 13:56       ` J. Bruce Fields
2016-08-24 14:02         ` Trond Myklebust
2016-08-24 14:16           ` Benjamin Coddington
2016-08-24 14:19             ` Trond Myklebust
2016-08-24 15:18               ` Benjamin Coddington
2016-08-24 14:20           ` Fields Bruce James
2016-08-24 14:26             ` Trond Myklebust
2016-08-24 14:40               ` J. Bruce Fields
2016-08-24 14:53                 ` Trond Myklebust
2016-08-24 15:16                   ` Fields Bruce James
2016-08-24 13:02     ` Benjamin Coddington
2016-08-23 19:36 ` J. Bruce Fields
2016-08-23 21:50   ` Benjamin Coddington

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.