All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID
@ 2010-06-09 16:42 Ken Hornstein
  2010-06-18 21:35 ` Andreas Dilger
  0 siblings, 1 reply; 8+ messages in thread
From: Ken Hornstein @ 2010-06-09 16:42 UTC (permalink / raw)
  To: lustre-devel

So I've been working on attribute caching for Lustre for the MacOS X port,
and I've run into a bit of a snag.

A bit of a brief explanation: unlike Linux, where (as I understand it)
filesystems have to store attributes in the vnode as they change, on
BSD-derived vnode filesystems you have a getattr function that gets
called to find out the attributes of a file.  You get passed a pointer
to the vnode (a sort-of virtual inode) and in that contains a pointer
to a structure which the filesystem manages; in Lustre's case, that
contains among other things a copy of the FID.  So just to recap, when
MacOS X wants the attributes of a file, it's done via FID, not by
filename.  I am aware that you can get attributes at lookup time, and I
plan on implementing that, but I figure I'd start with this first.

It is my understanding that to properly implement attribute caching, I
need to call md_intent_lock with the right op_data and an intent
structure with it_op set to IT_GETATTR.  When those attributes change,
I'll get notified in my ast callback.  If that's wrong, then please
correct me.

First off, I see that one of the options at connect time is
OBD_CONNECT_ATTRFID, and looking at the comments and the llite code for
this option it seems pretty self explanatory - you can fetch file
attributes by FID.  Perfect!  Well, like the llite code I test to make
sure OBD_CONNECT_ATTRFID is set in the connect flags, and I discover
that it is _not_.

The reason for this is that it's commented out of the
MDT_CONNECT_SUPPORTED macro in lustre_idl.h (and that's used to mask
out the reply you get indicating what is supported).  Now, it's not
clear to me why that was done: this happened back in 2007 as part of
commit d2d56f38, and the commit entry just says, "make HEAD from
b_post_cmd3", and it's been that way ever since.

The code still seems to exist on the server, so I figure I'll give it
a try anyway,  First issue: you need to include the parent when using
CONNECT_ATTRFID; I have to confess that I don't understand why, because
you would think that it wouldn't be necessary just to get the attributes
for a file, but this in enforced on the server (unfortunately, by
kicking an LBUG).  This is a problem for the root directory since it's
the first one looked up _and_ it doesn't have a parent, but I can special
case that; for all other vnodes, the parent is easy to find (the filesystem
layer keeps track of it).

When I try doing _that_, I get an error from the lmv routines that I haven't
tracked down yet.  But the whole thing makes me wonder if perhaps I'm
doing the wrong thing.

From what I see, attributes are also cached at open/lookup time and as
part of the statahead code; I can do that relatively easy.  But if
attributes _change_, then I have to wonder what is supposed to happen
next.  Since current Lustre servers don't return OBD_CONNECT_ATTRFID as
being set anymore, the code in the client doesn't seem like it will
cache the attribute (for files, anyway) anymore.  It's entirely possible 
that I'm missing something; that happens a lot.

So, I guess the question I have is ... if you get your lock broken when
you have it on a file's attributes, exactly what is supposed to happen next?
What's the mechanism for getting it again?

--Ken

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID
  2010-06-09 16:42 [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID Ken Hornstein
@ 2010-06-18 21:35 ` Andreas Dilger
  2010-06-21  2:42   ` Ken Hornstein
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2010-06-18 21:35 UTC (permalink / raw)
  To: lustre-devel

On 2010-06-09, at 10:42, Ken Hornstein wrote:
> So I've been working on attribute caching for Lustre for the MacOS X port,
> and I've run into a bit of a snag.

Ken,
sorry for the delay in replying, this mail disappeared off the top of my inbox.

> A bit of a brief explanation: unlike Linux, where (as I understand it)
> filesystems have to store attributes in the vnode as they change, on
> BSD-derived vnode filesystems you have a getattr function that gets
> called to find out the attributes of a file.

Linux VFS is kind of half-way between that.  There is a "common" inode structure that has information needed for any kind of inode (owner, size, blocks, etc), and then a "private" inode structure with filesystem-specific data, pointers, etc.

>  You get passed a pointer
> to the vnode (a sort-of virtual inode) and in that contains a pointer
> to a structure which the filesystem manages; in Lustre's case, that
> contains among other things a copy of the FID.  So just to recap, when
> MacOS X wants the attributes of a file, it's done via FID, not by
> filename. I am aware that you can get attributes at lookup time, and I
> plan on implementing that, but I figure I'd start with this first.

Right, this getattr-by-FID is done when revalidating an inode, but as you write it is normally done at lookup time.

> It is my understanding that to properly implement attribute caching, I
> need to call md_intent_lock with the right op_data and an intent
> structure with it_op set to IT_GETATTR.  When those attributes change,
> I'll get notified in my ast callback.  If that's wrong, then please
> correct me.

The "intent" part of the locking is that the client is trying to write-lock the directory with the intent looking up the file to open+create it.  In all cases today the MDS does not grant the write lock on the directory, and replies to the client essentially "you can't have the directory lock, but I opened/created the file for you, and here are the attributes".

> First off, I see that one of the options at connect time is
> OBD_CONNECT_ATTRFID, and looking at the comments and the llite code for
> this option it seems pretty self explanatory - you can fetch file
> attributes by FID.  Perfect!  Well, like the llite code I test to make
> sure OBD_CONNECT_ATTRFID is set in the connect flags, and I discover
> that it is _not_.
> 
> The reason for this is that it's commented out of the
> MDT_CONNECT_SUPPORTED macro in lustre_idl.h (and that's used to mask
> out the reply you get indicating what is supported).  Now, it's not
> clear to me why that was done: this happened back in 2007 as part of
> commit d2d56f38, and the commit entry just says, "make HEAD from
> b_post_cmd3", and it's been that way ever since.

I was going to point out where this wasn't correct, but it seems this IS commented out, and it would seem to be a legitimate problem.  I wasn't involved in CMD3 so offhand I can't say either why it was changed.  I'm doing some spelunking in CVS to see if there is an explanation for it.  From what I can gather, the commit comment is:

revision 1.1.14.82.2.12
date: 2007/06/14 06:54:00;  author: tappro
- disable ATTRFID feature due to cmd restriction
  bz12718
r:huanghua,fanyong


And in bugzilla bug 12718 (among the dozens of other patches landed from that bug) there is a comment:

>> Created an attachment (id=11067)  ATTRFID disable
>> 
>> ATTRFID should be disabled in CMD. The reason for that is locking - getting attr by fid require LOOKUP|UPDATE lock (LOOKUP is for access rights attributes) but in CMD LOOKUP lock can be already taken on another MDS - on name.
>> This is old CMD issue and the solution will be provided later, it seems we need new lock (e.g. PERMISSION) to protect such attributes. For now we just disable ATTRFID support.       
> 
> But disabling ATTRFID will cause sanity 29 to fail.
> Please investigate, or find solution for sanity 29.

But this question was never answered, and the patch was landed.  To my reading, the lack of ATTRFID support would prevent attribute caching on clients connected to 2.x servers once they lose their lock - they will repeatedly to MDS_GETATTR RPCs and never be able to cache them.

Seeing as 2.0 does not support CMD, and even in 2.x when we support CMD not all systems will be configured with CMD it should be possible to enable ATTRFID if no CMD is enabled.  

> The code still seems to exist on the server, so I figure I'll give it
> a try anyway,  First issue: you need to include the parent when using
> CONNECT_ATTRFID; I have to confess that I don't understand why, because
> you would think that it wouldn't be necessary just to get the attributes
> for a file, but this in enforced on the server (unfortunately, by
> kicking an LBUG).

It is a serious defect to LBUG() on the server due to bad client data/request.  Could you please file a bug on that?

>  This is a problem for the root directory since it's
> the first one looked up _and_ it doesn't have a parent, but I can special
> case that; for all other vnodes, the parent is easy to find (the filesystem
> layer keeps track of it).

I think that is fixing the symptom and not the cause.  I'm not sure why we need the parent FID to get a lock on the child by itself, but even the client can't be 100% sure that it will have the parent.  For example, if the Lustre client is re-exporting the filesystem via NFS, then it has crashed and lost the in-memory directory hierarchy, and then gets a revalidation/getattr request from its NFS client without the parent it won't have the parent directory.  IIRC, this is the original reason we added ATTRFID support, was for NFS to be able to cache attributes for such inodes.

> When I try doing _that_, I get an error from the lmv routines that I haven't
> tracked down yet.  But the whole thing makes me wonder if perhaps I'm
> doing the wrong thing.
> 
> From what I see, attributes are also cached at open/lookup time and as
> part of the statahead code; I can do that relatively easy.  But if
> attributes _change_, then I have to wonder what is supposed to happen
> next.

If there is a lock on the attributes, the lock will be revoked and the client's attributes will no longer be valid.  Presumably, either the client will do another lookup by name and re-get the lock on that inode, or the internal revalidation will just keep on doing MDS_GETATTR RPCs and the attributes will not be cached.

> Since current Lustre servers don't return OBD_CONNECT_ATTRFID as
> being set anymore, the code in the client doesn't seem like it will
> cache the attribute (for files, anyway) anymore.  It's entirely possible 
> that I'm missing something; that happens a lot.
> 
> So, I guess the question I have is ... if you get your lock broken when
> you have it on a file's attributes, exactly what is supposed to happen next?
> What's the mechanism for getting it again?

Well, it should do a getattr-by-FID and re-fetch the lock, which is exactly what happens with 1.6/1.8 servers.  I wonder if we even have a test workload that would expose this difference, since many tests like "ls -l" will always do a stat(filename), and I suspect very few tests will keep a filehandle open and do fstat(fd) for a long enough time for the client to lose its lock on the inode.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID
  2010-06-18 21:35 ` Andreas Dilger
@ 2010-06-21  2:42   ` Ken Hornstein
  2010-06-21 14:34     ` Andreas Dilger
  0 siblings, 1 reply; 8+ messages in thread
From: Ken Hornstein @ 2010-06-21  2:42 UTC (permalink / raw)
  To: lustre-devel

>sorry for the delay in replying, this mail disappeared off the top of my
>inbox.

Understood; I've been caught up in other things as well.

>The "intent" part of the locking is that the client is trying to
>write-lock the directory with the intent looking up the file to
>open+create it.  In all cases today the MDS does not grant the write
>lock on the directory, and replies to the client essentially "you can't
>have the directory lock, but I opened/created the file for you, and here
>are the attributes".

Ah, okay understood (I know you have explained this before; it's all coming
back now).

>But this question was never answered, and the patch was landed.  To my
>reading, the lack of ATTRFID support would prevent attribute caching on
>clients connected to 2.x servers once they lose their lock - they will
>repeatedly to MDS_GETATTR RPCs and never be able to cache them.

Ah-HA!  Well, that's sorta what I had thought, and I figured that wasn't
right.  But I'm glad to see at least that we agree on thatpart :-/

>It is a serious defect to LBUG() on the server due to bad client
>data/request.  Could you please file a bug on that?

Sure, will do.

>I think that is fixing the symptom and not the cause.  I'm not sure why
>we need the parent FID to get a lock on the child by itself, but even
>the client can't be 100% sure that it will have the parent.  For
>example, if the Lustre client is re-exporting the filesystem via NFS,
>then it has crashed and lost the in-memory directory hierarchy, and then
>gets a revalidation/getattr request from its NFS client without the
>parent it won't have the parent directory.  IIRC, this is the original
>reason we added ATTRFID support, was for NFS to be able to cache
>attributes for such inodes.

Ahhh ... okay.  From my reading of the code, the parent directory is locked,
but then released.  Again, I'm not sure why that happens ... the code there
is a bit hairy to my eyes.

>Well, it should do a getattr-by-FID and re-fetch the lock, which is
>exactly what happens with 1.6/1.8 servers.  I wonder if we even have a
>test workload that would expose this difference, since many tests like
>"ls -l" will always do a stat(filename), and I suspect very few tests
>will keep a filehandle open and do fstat(fd) for a long enough time for
>the client to lose its lock on the inode.

Well, I guess the question is ... does a "ls -l" involve doing a
new LOOKUP on Linux?.  I was kinda hoping to avoid that on the
Lustre client; the name-to-vnode mapping is controlled by the layer
above me in vnode filesystems.  Obviously I provide my own lookup()
routine to perform that mapping, but once that is done I tell the
upper layer about the name->vnode and it is cached, and obviously
I want that caching to stick around to avoid LOOKUP rpcs.

Now here's a thought: if I get a broken lock on a vnode, I can tell the
upper layer to invalidate the name->vnode mapping (can I do that?  Yes,
it turns out I can).  That will mean lookup() will be called the next time
around.  I think in terms of RPC round-trip, a GETATTR is the same as a
LOOKUP, right?  It feels kinda wrong to do it that way, though.

--Ken

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID
  2010-06-21  2:42   ` Ken Hornstein
@ 2010-06-21 14:34     ` Andreas Dilger
  2010-06-21 19:17       ` Ken Hornstein
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2010-06-21 14:34 UTC (permalink / raw)
  To: lustre-devel


On 2010-06-20, at 20:42, Ken Hornstein <kenh@cmf.nrl.navy.mil> wrote:
> Now here's a thought: if I get a broken lock on a vnode, I can tell  
> the
> upper layer to invalidate the name->vnode mapping (can I do that?   
> Yes,
> it turns out I can).  That will mean lookup() will be called the  
> next time
> around.  I think in terms of RPC round-trip, a GETATTR is the same  
> as a
> LOOKUP, right?  It feels kinda wrong to do it that way, though.

This is in fact the right way to do it. If the lock is cancelledq then  
the client has no way of knowing whether that name will continue to  
point at the same FID or not.

Also note we recently changed the Linux client to invalidate the inode  
when all of the DLM locks related to that inode are cancelled.  This  
ensures the client 'forgets' about deleted inodes even when another  
client deleted it. Otherwise, the client will accumulate a lot of  
unnecessary inodes in cache.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID
  2010-06-21 14:34     ` Andreas Dilger
@ 2010-06-21 19:17       ` Ken Hornstein
  2010-06-21 21:35         ` Andreas Dilger
  0 siblings, 1 reply; 8+ messages in thread
From: Ken Hornstein @ 2010-06-21 19:17 UTC (permalink / raw)
  To: lustre-devel

>This is in fact the right way to do it. If the lock is cancelledq then  
>the client has no way of knowing whether that name will continue to  
>point at the same FID or not.

I guess I was thinking in terms of inodebits locks, you may (for example)
get the file size invalidated, but the name/FID is still the same.  Is
my understanding of inodebits locks wrong?  It's entirely possible that
it is.

>Also note we recently changed the Linux client to invalidate the inode  
>when all of the DLM locks related to that inode are cancelled.  This  
>ensures the client 'forgets' about deleted inodes even when another  
>client deleted it. Otherwise, the client will accumulate a lot of  
>unnecessary inodes in cache.

I guess I could simply revoke the vnode when all of the DLM locks for
it are cancelled (that tells the upper layer to forget everything you
knew about the vnode).  That seems ... well, I guess I can't see
anything wrong with it.  I guess now that you mention it, if another
client deletes a file and you don't have a lock on it, you're not
necessarily going to know that the delete happened, will you?
But what happens if that file that's been invalidated is in use?

Hm, actually, I see now that revoking the vnode has unintended
consequences if someone has it open (as in, you won't be able to do
anything but close it).  Perhaps a recycle is better; I'll have to
figure that one out.

Could you point me to the section of the code in the Linux client that
does the inode invalidation upon lock cancellation?  I want to make sure
that I follow it and I'm doing the right thing.

--Ken

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID
  2010-06-21 19:17       ` Ken Hornstein
@ 2010-06-21 21:35         ` Andreas Dilger
  2010-06-30 18:23           ` Ken Hornstein
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2010-06-21 21:35 UTC (permalink / raw)
  To: lustre-devel

On 2010-06-21, at 13:17, Ken Hornstein wrote:
>> This is in fact the right way to do it. If the lock is cancelledq then  
>> the client has no way of knowing whether that name will continue to  
>> point at the same FID or not.
> 
> I guess I was thinking in terms of inodebits locks, you may (for example)
> get the file size invalidated, but the name/FID is still the same.  Is
> my understanding of inodebits locks wrong?  It's entirely possible that
> it is.

That is correct.  It is possible to grant multiple ibits on the same lock, or it is possible to grant separate ibits on different locks.  Currently, we don't have DLM lock conversion, to allow e.g. cancelling only one bit from a granted lock.  That means if one lock bit is conflicting the whole lock (i.e. all its bits) will be cancelled.  The client will re-request only the ibits that it needs to avoid future contention.

>> Also note we recently changed the Linux client to invalidate the inode  
>> when all of the DLM locks related to that inode are cancelled.  This  
>> ensures the client 'forgets' about deleted inodes even when another  
>> client deleted it. Otherwise, the client will accumulate a lot of  
>> unnecessary inodes in cache.
> 
> I guess I could simply revoke the vnode when all of the DLM locks for
> it are cancelled (that tells the upper layer to forget everything you
> knew about the vnode).  That seems ... well, I guess I can't see
> anything wrong with it.  I guess now that you mention it, if another
> client deletes a file and you don't have a lock on it, you're not
> necessarily going to know that the delete happened, will you?
> But what happens if that file that's been invalidated is in use?

Currently, the Linux VFS will flag it as "DCACHE_INVALID" and it needs to be revalidated before use.

> Hm, actually, I see now that revoking the vnode has unintended
> consequences if someone has it open (as in, you won't be able to do
> anything but close it).  Perhaps a recycle is better; I'll have to
> figure that one out.
> 
> Could you point me to the section of the code in the Linux client that
> does the inode invalidation upon lock cancellation?  I want to make sure
> that I follow it and I'm doing the right thing.

The majority of this work was done in bug 20433.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID
  2010-06-21 21:35         ` Andreas Dilger
@ 2010-06-30 18:23           ` Ken Hornstein
  2010-06-30 18:27             ` Oleg Drokin
  0 siblings, 1 reply; 8+ messages in thread
From: Ken Hornstein @ 2010-06-30 18:23 UTC (permalink / raw)
  To: lustre-devel

Alright, so, I had some time to stare at this code some more ... a few
questions.

>Currently, the Linux VFS will flag it as "DCACHE_INVALID" and it needs
>to be revalidated before use.

This is DCACHE_LUSTRE_INVALID, right?  I couldn't find DCACHE_INVALID
anywhere.

>> Could you point me to the section of the code in the Linux client that
>> does the inode invalidation upon lock cancellation?  I want to make sure
>> that I follow it and I'm doing the right thing.
>
>The majority of this work was done in bug 20433.

I see the patch has been re-worked a bit in more recent Lustre.  As I
read it ... this just gets rid of the dentry for the inode, right?
That forces a new name lookup the next time the file is accesed by
name, right?  I guess that's how the lack of CONNECT_ATTRFID hasn't
been noticed.

As a side question ... I know it hasn't really been resolved (from what
I've read on this thread), but going forward should I assume that
CONNECT_ATTRFID will not be available?  It sounds like from what you
are saying that it will eventually not be available.

--Ken

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID
  2010-06-30 18:23           ` Ken Hornstein
@ 2010-06-30 18:27             ` Oleg Drokin
  0 siblings, 0 replies; 8+ messages in thread
From: Oleg Drokin @ 2010-06-30 18:27 UTC (permalink / raw)
  To: lustre-devel

Hello!

On Jun 30, 2010, at 2:23 PM, Ken Hornstein wrote:

> Alright, so, I had some time to stare at this code some more ... a few
> questions.
> 
>> Currently, the Linux VFS will flag it as "DCACHE_INVALID" and it needs
>> to be revalidated before use.
> This is DCACHE_LUSTRE_INVALID, right?  I couldn't find DCACHE_INVALID
> anywhere.

Yes.

>>> Could you point me to the section of the code in the Linux client that
>>> does the inode invalidation upon lock cancellation?  I want to make sure
>>> that I follow it and I'm doing the right thing.
>> The majority of this work was done in bug 20433.
> I see the patch has been re-worked a bit in more recent Lustre.  As I
> read it ... this just gets rid of the dentry for the inode, right?
> That forces a new name lookup the next time the file is accesed by
> name, right?  I guess that's how the lack of CONNECT_ATTRFID hasn't
> been noticed.

Right. The only way to encounter attr_by_fid is to execute something like fstat
on an already opened file descriptor.
Also there used to be NFS implications, I think.

> As a side question ... I know it hasn't really been resolved (from what
> I've read on this thread), but going forward should I assume that
> CONNECT_ATTRFID will not be available?  It sounds like from what you
> are saying that it will eventually not be available.

Actually it should be available going forward, at least I hope so.
This is a pretty important feature to ensure that open-unlinked files
work correctly.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-06-30 18:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-09 16:42 [Lustre-devel] Attribute caching and OBD_CONNECT_ATTRFID Ken Hornstein
2010-06-18 21:35 ` Andreas Dilger
2010-06-21  2:42   ` Ken Hornstein
2010-06-21 14:34     ` Andreas Dilger
2010-06-21 19:17       ` Ken Hornstein
2010-06-21 21:35         ` Andreas Dilger
2010-06-30 18:23           ` Ken Hornstein
2010-06-30 18:27             ` Oleg Drokin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.