Re: robinhood, fanotify name info events and lustre changelog

From: Amir Goldstein <amir73il@gmail.com>
To: "Quentin.BOUGET@cea.fr" <Quentin.BOUGET@cea.fr>
Cc: Dominique Martinet <asmadeus@codewreck.org>,
	Jan Kara <jack@suse.cz>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	"robinhood-devel@lists.sf.net" <robinhood-devel@lists.sf.net>
Subject: Re: robinhood, fanotify name info events and lustre changelog
Date: Sat, 30 May 2020 16:07:45 +0300	[thread overview]
Message-ID: <CAOQ4uxgpugScXRLT6jJAAZf_ET+DpmEWoqkSdqCAMEwp+Kezhw@mail.gmail.com> (raw)
In-Reply-To: <1590777699518.49838@cea.fr>

On Fri, May 29, 2020 at 9:41 PM Quentin.BOUGET@cea.fr
<Quentin.BOUGET@cea.fr> wrote:
>
> Hi,
>
> Developer of robinhood v4 here,
>
> > > > [1] https://github.com/cea-hpc/robinhood/
>
> The sources for version 4 live in a separate branch:
> https://github.com/cea-hpc/robinhood/tree/v4
>
> Any feedback is welcome =)
>
> I am guessing the most interesting bits for this discussion should be found
> here:
> https://github.com/cea-hpc/robinhood/blob/v4/include/robinhood/fsevent.h
>

That is a very well documented API and a valuable resource for me.

Notes for API choices that are aligned with current fanotify plans:
- The combination of parent fid + object fid without name is never expected

Notes for API choices that are NOT aligned with current fanotify plans:
- LINK/UNLINK events carry the linked/unlinked object fid
- XATTR events for inode (not namespace) do not carry parent fid/name

This doesn't mean that fanotify -> rbh_fsevent translation is not going to
be possible.

With fanotify FAN_CREATE event, for example, the parent fid + name
information should be used by the rbh adapter code to call
name_to_handle_at(2) and get the created object's file handle.

The reason we made this API choice is because fanotify events should
not be perceived as a sequence of changes that can be followed to
describe the current state of the filesystem.
fanotify events should be perceived as a "poll" on the namespace.
Whenever notified of a change, application should read the current state
for the filesystem. fanotify events provide "just enough" information, so
reading the current state of the filesystem is not too expensive.

> I am not sure it will matter for the rest of the conversation, but just in case:
>
>     RobinHood v4 has a notion of a "namespace" xattr (like an xattr, but for
>     a dentry rather than an inode), it is used it to store things that are only
>     really tied to the namespace (like the path of an entry). I don't think this
>     is really relevant here, you can probably ignore it.
>
>     Also, RobinHood uses file handles to uniquely identify filesystem entries,
>     and this is what is stored in a `struct rbh_id`.
>

Makes sense.
When fanotify event FAN_MODIFY reports a change of file size,
along with parent fid + name, that do not match the parent/name robinhood
knows about (i.e. because the event is received out of order with rename),
you may use that information to create rbh_fsevent_ns_xattr event to
update the path or you may wait for the FAN_MOVE_SELF event that
should arrive later.
Up to you.

> > > I couldn't find the documentation for Lustre Changelog format, because
> > > the name of the feature is not very Google friendly.
>
> Yes, this is really unfortunate. For the record, user documentation for Lustre
> lives at: http://doc.lustre.org/lustre_manual.xhtml
>
> Chapter 12.1 deals with "Lustre Changelogs" (not much more there than
> what Dominique already wrote).
>

Thanks for the link.

> > > There is one critical difference between a changelog and fanotify events.
> > > fanotify events are delivered a-synchronically and may be delivered out
> > > of order, so application must not rely on path information to update
> > > internal records without using fstatat(2) to check the actual state of the
> > > object in the filesystem.
>
> > lustre changelogs are asynchronous but the order is guaranteed so we
> > might rely on that for robinhood v4,
>
> Yes, we do. At least to a certain extent : we at least expect changelog records
> for a single filesystem entry to be emitted in the order they happened on the
> FS. I have not really given much thought to how things would work in general
> if that wasn't true, but I know this is an issue for things that deal with the
> namespace : https://jira.whamcloud.com/browse/LU-12574
>

I think we may consider in the future, since renaming directories outside
of their parent is done in the kernel under a filesystem wide lock, to
provide a new event FAN_DIR_MOVE which is not merged and not
re-odered with other FAN_DIR_MOVE events. We may even be able to
go one step further and say that FAN_DIR_MOVE is a barrier with which
no event inside the moved dir can be re-ordered, but at the moment,
there is no such guaranty for any type of event.

> > but full path is not computed from
> > information in the changelogs. Instead the design plan is to have a
> > process scrub the database for files that got updated since the last
> > path update and fix paths with fstatat, so I think it might work ; but
> > that unfortunately hasn't been implemented yet.
>
> Not exactly (I am not sure it really matters, so I'll try to be brief).
>
> The idea to keep paths in sync with what's in the filesystem is to "tag"
> entries as we update their name (ie. after a rename). Then a separate
> process comes in, queries for entries that have that "tag", and updates
> their path by concatenating their parent's path (if the parents themselves
> are not "tagged") with the entries' own, up-to-date name. After that, if
> the entry was a directory, its children are "tagged". I simplified a bit, but
> that's the idea.
>

Nice. thanks for explaining that.
I suppose you need to store the calculated path attribute for things like
index queries on the database?

> So, to be fair, full paths _are_ computed solely from information in the
> changelog records, even though it requires a bit of processing on the side.
> No additional query to the filesystem for that.
>

As I wrote, that fact that robinhood trusts the information in changelog
records doesn't mean that information needs to arrive from the kernel.
The adapter code should use information provided by fanotify events
then use open_by_handle_at(2) for directory fid to finds its current
path in the filesystem then feed that information to a robinhood change
record.

I would be happy to work with you on a POC for adapting fanotify
test code with robinhood v4, but before I invest time on that, I would
need to know there is a good chance that people are going to test and
use robinhood with Linux vfs.

May I ask, what is the reason for embarking on the project to decouple
robinhood v4 API from Lustre changelog API?
Is it because you had other fsevent producers in mind?
Do you have actual users requesting to use robinhood with non-Lustre fs?

Thanks,
Amir.