All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] Fanotify revert for 5.7-rc8
@ 2020-05-27 17:21 Jan Kara
  2020-05-27 18:10 ` pr-tracker-bot
       [not found] ` <20200527173937.GA17769@nautica>
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Kara @ 2020-05-27 17:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Amir Goldstein

  Hello Linus,

  could you please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fsnotify_for_v5.7-rc8

to get a patch that disables FAN_DIR_MODIFY support that was merged in
5.7-rc1. When discussing further functionality we realized it may be more
logical to guard it with a feature flag or to call things slightly
differently (or maybe not) so let's not set the API in stone for now.

Top of the tree is f17936993af0. The full shortlog is:

Amir Goldstein (1):
      fanotify: turn off support for FAN_DIR_MODIFY

The diffstat is

 fs/notify/fanotify/fanotify.c | 2 +-
 include/linux/fanotify.h      | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

							Thanks
								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [GIT PULL] Fanotify revert for 5.7-rc8
  2020-05-27 17:21 [GIT PULL] Fanotify revert for 5.7-rc8 Jan Kara
@ 2020-05-27 18:10 ` pr-tracker-bot
       [not found] ` <20200527173937.GA17769@nautica>
  1 sibling, 0 replies; 10+ messages in thread
From: pr-tracker-bot @ 2020-05-27 18:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, linux-fsdevel, Amir Goldstein

The pull request you sent on Wed, 27 May 2020 19:21:43 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fsnotify_for_v5.7-rc8

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b0c3ba31be3e45a130e13b278cf3b90f69bda6f6

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: robinhood, fanotify name info events and lustre changelog
       [not found]   ` <CAOQ4uxjQXwTo1Ug4jY1X+eBdLj80rEfJ0X3zKRi+L8L_uYSrgQ@mail.gmail.com>
@ 2020-05-28 12:56     ` Dominique Martinet
  2020-05-29 18:41       ` Quentin.BOUGET
  0 siblings, 1 reply; 10+ messages in thread
From: Dominique Martinet @ 2020-05-28 12:56 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-api, robinhood-devel

Amir Goldstein wrote on Thu, May 28, 2020:
> Since you started this thread privately, I am replying privately,
> but if you don't mind, please respond with CC to linux-fsdevel, linux-api
> and also lustre lists if you like, so other developers may participate in
> the discussion.

No problem going public; added linux-fsdevel, linux-api as suggested and
robinhood-devel for the robinhood side.
It might be interesting to retrofit lustre changelogs into the fanotify
API at some point but I don't see it likely to happen, so let's start
small for now :)

> > (Probably the same as most filesystem indexers would want, I would use
> > it for robinhood[1] - it normally consumes lustre changelogs and not
> > local vfs events so $job doesn't really care, but I would use a fanotify
> > mode for home once it becomes useable because why not :D)
> >
> > [1] https://github.com/cea-hpc/robinhood/
> 
> This looks very interesting.
> So, do you intend to integrate fanotify with robinhood as a hobby project?
> I wonder, I did not find much evidence of robinhood being used outside of
> the Lustre community and without the Lustre Changelog.
> At least least since 2008, I see no public discussions on devel lists
> and only changes seems related to the Lustre mode.

I know robinhood has also been used to purge NFS home directories
("temp" directories that are less restricted than homes in volume but
get purged after x days), both at CEA and in other companies who reached
out to me privately so I cannot name them.
FWIW on this side netapp filers support something similar to changelogs
in the form of audit loggings, which we never bothered implementing but
would probably be accepted if someone bothered -- but with linux knfsd
maybe fanotify on the server side might work ? I haven't tried.

As far as I know that means people using robinhood on NFS just run full
filesystem scans every day/week/x.



That being said, it is also true that robinhood has very few users
outside of the lustre community; I use it for manual file scrubbing
(verifying checksums on a semi-regular basis) at home. As you pointed
out, that really is in the realm of hobby project even if that helped
find a few bugs.


> I am asking because this project looks like it could be interesting for $job.
> I was looking for a "champion app" to demonstrate new fanotify features.
> I chose inotify-tools for the demo, because it was the easiest to adapt,
> but was going to start a more serious look into Watchman.
> Watchman seems to be in heavy use in Facebook and actively maintained.
> It's starting point is inotify (+fs scanner of course), so I expect it
> would be an
> easier fit than to start from Lustre Changelog as a starting point. or
> not?

robinhood (current master branch) is quite heavily tied to lustre. I
think Cray had started porting the code to use VFS file handles instead
of lustre FIDs to make it easier to use but that never quite finished.
OTOH, robinhood v4 has no adherence to lustre, but is still work in
progress. Quentin in Cc has some proof of concept at ingesting
changelogs.
It has been designed with me in mind so should be much easier to
integrate in there (the lustre portion just converts changelogs to a
robinhood-specific 'fsevents' format which is then injected, so there
would be just that fanotify->fsevents conversion to do), but it is still
very young and doesn't have all the features of v3 so might be less
adapted for a champion project.

I'm not sure what to advise on there, from what I'm reading of watchman
it would probably be easier to integrate with than robinhood v3 for
sure, so if you want code to go into a currently-running version it
might be easier to go with that.
(if you do want to do the work for robinhood v3 though I think we would
be happy to integrate the change even with v4 underway, but I am not
responsible for that so cannot make promises; we'd probably be happier
with v4 as a target in the long run)


> I couldn't find the documentation for Lustre Changelog format, because
> the name of the feature is not very Google friendly.
> But looking at the robinhood source code, the direction we are going
> with fanotify seems to be consistent with the designs of Lustre Changelog.
> 
> I am including some snippets from robinhood  chglog_reader.c
> that Jan may find interesting:
> 
> #define PFID(_pid) (_pid)->fs_key, (_pid)->inode
> #define CL_NAME_ARG(_rec_) PFID(&(_rec_)->cr_pfid), (_rec_)->cr_namelen, \
>         rh_get_cl_cr_name(_rec_)
> #define CL_EXT_FORMAT   "s="DFID" sp="DFID" %.*s"
> #elif defined(HAVE_CHANGELOG_EXTEND_REC)
>         if (fid_is_sane(&rec->cr_sfid)) {
>             len = snprintf(curr, left, " " CL_EXT_FORMAT,
>                            PFID(&rec->cr_sfid),
>                            PFID(&rec->cr_spfid),
>                            changelog_rec_snamelen((CL_REC_TYPE *)rec),
>                            changelog_rec_sname((CL_REC_TYPE *)rec));
> 
>     /* parent id is always set when name is (Cf. comment in lfs.c) */
> 
>             /* Ensure compatibility with older Lustre versions:
>              * push RNMFRM to remove the old path from NAMES table.
>              * push RNMTO to add target path information.
>              */
> 
> It looks like the Lustre change record "extended" format is on par with
> the information that the fanotify name info events that patches v3 [1]
> are providing for events "on child" (e.g FAN_MODIFY).

Here are a few example (logs of) changelogs so you get an idea; but it
looks like you understood this correctly (filenames and jobnames
retracted for privacy; we don't actually use the jobnames for robinhood
itself)
2020/05/28 03:49:26 [383/2] fsname-MDT0001: 62545421787 13TRUNC 1590630534.494881281 0xe t=[0xcc005c7aa:0x12cf7:0x0] J=jobname
2020/05/28 03:49:26 [383/2] fsname-MDT0001: 62545421788 11CLOSE 1590630534.495782850 0x43 t=[0xcc005c7aa:0x12cf7:0x0] J=jobname
2020/05/28 03:49:26 [383/2] fsname-MDT0001: 62545422212 01CREAT 1590630535.038294162 0x0 t=[0xcc0056422:0x1e071:0x0] p=[0xcc0056422:0x1e005:0x0] filename J=jobname
2020/05/28 03:49:26 [383/2] fsname-MDT0001: 62545448338 08RENME 1590630550.659753428 0x0 t=[0:0x0:0x0] p=[0xcc005f145:0x12da:0x0] filename_from s=[0xcc00600d7:0xe0:0x0] sp=[0xcc005f145:0x12da:0x0] filename_to J=jobname
2020/05/28 03:49:26 [383/2] fsname-MDT0001: 62545449617 06UNLNK 1590630551.756078437 0x1 t=[0xcc005ff7f:0x42bd:0x0] p=[0xcc0057c27:0x1dc1d:0x0] filename J=jobname
2020/05/28 03:49:58 [383/2] fsname-MDT0001: 62545494822 14SATTR 1590630574.376208143 0x14 t=[0xcc005f9a4:0xa9a6:0x0] J=jobname
2020/05/28 03:51:02 [383/2] fsname-MDT0001: 62545616687 02MKDIR 1590630648.489224641 0x0 t=[0xcc0045fd0:0x8e4b:0x0] p=[0xcc0036d90:0x1:0x0] 

So you always have object fid being acted on, and (parent fid + name
component) for source and destination if they matter (e.g. setattr won't
have any name, but create will have one, and rename both)


> It is not clear to me if Lustre Changelog provides the "extended"
> record for create/rename/delete, but we were not planning to do that.

Ok.

> There is one critical difference between a changelog and fanotify events.
> fanotify events are delivered a-synchronically and may be delivered out
> of order, so application must not rely on path information to update
> internal records without using fstatat(2) to check the actual state of the
> object in the filesystem.

lustre changelogs are asynchronous but the order is guaranteed so we
might rely on that for robinhood v4, but full path is not computed from
information in the changelogs. Instead the design plan is to have a
process scrub the database for files that got updated since the last
path update and fix paths with fstatat, so I think it might work ; but
that unfortunately hasn't been implemented yet.
(so db update would be done in multiple steps; but it should also be
possible to supplement informations in the pipeline, because lustre
changelogs doesn't have size etc which are in the db and that might also
be able to take care of path updates; I guess both models should work
for fanotify since the stat itself is synchronous and you can get path
from /proc/self/fd/x on local filesystems (it doesn't work on lustre;
there's a fid2path helper though))

robinhood v3 systematically does a stat and recomputes path from fid.

> For that reason, we defined the FAN_DIR_MODIFY event, which carries
> info of parent fid and name that can be used for fstatat(2).
> As of yesterday, FAN_DIR_MODIFY is disabled in master, so will not be
> available in v5.7. We are planning to re-able it in the future with an
> appropriate fanotify_init(2) flag for reporting file names.

Yes that started this thread :)
I'm happy to run tests with a custom branch if you need to; we run rhel
kernels normally so would need to recompile anyway.

Thanks!
-- 
Dominique

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: robinhood, fanotify name info events and lustre changelog
  2020-05-28 12:56     ` robinhood, fanotify name info events and lustre changelog Dominique Martinet
@ 2020-05-29 18:41       ` Quentin.BOUGET
  2020-05-30 13:07         ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Quentin.BOUGET @ 2020-05-29 18:41 UTC (permalink / raw)
  To: Dominique Martinet, Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, linux-api, robinhood-devel

Hi,

Developer of robinhood v4 here,

> > > [1] https://github.com/cea-hpc/robinhood/

The sources for version 4 live in a separate branch:
https://github.com/cea-hpc/robinhood/tree/v4

Any feedback is welcome =)

I am guessing the most interesting bits for this discussion should be found
here:
https://github.com/cea-hpc/robinhood/blob/v4/include/robinhood/fsevent.h

I am not sure it will matter for the rest of the conversation, but just in case:

    RobinHood v4 has a notion of a "namespace" xattr (like an xattr, but for
    a dentry rather than an inode), it is used it to store things that are only
    really tied to the namespace (like the path of an entry). I don't think this
    is really relevant here, you can probably ignore it.

    Also, RobinHood uses file handles to uniquely identify filesystem entries,
    and this is what is stored in a `struct rbh_id`.

> > I couldn't find the documentation for Lustre Changelog format, because
> > the name of the feature is not very Google friendly.

Yes, this is really unfortunate. For the record, user documentation for Lustre
lives at: http://doc.lustre.org/lustre_manual.xhtml

Chapter 12.1 deals with "Lustre Changelogs" (not much more there than
what Dominique already wrote).

> > There is one critical difference between a changelog and fanotify events.
> > fanotify events are delivered a-synchronically and may be delivered out
> > of order, so application must not rely on path information to update
> > internal records without using fstatat(2) to check the actual state of the
> > object in the filesystem.

> lustre changelogs are asynchronous but the order is guaranteed so we
> might rely on that for robinhood v4,

Yes, we do. At least to a certain extent : we at least expect changelog records
for a single filesystem entry to be emitted in the order they happened on the
FS. I have not really given much thought to how things would work in general
if that wasn't true, but I know this is an issue for things that deal with the
namespace : https://jira.whamcloud.com/browse/LU-12574

> but full path is not computed from
> information in the changelogs. Instead the design plan is to have a
> process scrub the database for files that got updated since the last
> path update and fix paths with fstatat, so I think it might work ; but
> that unfortunately hasn't been implemented yet.

Not exactly (I am not sure it really matters, so I'll try to be brief).

The idea to keep paths in sync with what's in the filesystem is to "tag"
entries as we update their name (ie. after a rename). Then a separate
process comes in, queries for entries that have that "tag", and updates
their path by concatenating their parent's path (if the parents themselves
are not "tagged") with the entries' own, up-to-date name. After that, if
the entry was a directory, its children are "tagged". I simplified a bit, but
that's the idea.

So, to be fair, full paths _are_ computed solely from information in the
changelog records, even though it requires a bit of processing on the side.
No additional query to the filesystem for that.

Cheers,
Quentin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: robinhood, fanotify name info events and lustre changelog
  2020-05-29 18:41       ` Quentin.BOUGET
@ 2020-05-30 13:07         ` Amir Goldstein
  2020-05-30 13:39           ` Dominique Martinet
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2020-05-30 13:07 UTC (permalink / raw)
  To: Quentin.BOUGET
  Cc: Dominique Martinet, Jan Kara, linux-fsdevel, linux-api, robinhood-devel

On Fri, May 29, 2020 at 9:41 PM Quentin.BOUGET@cea.fr
<Quentin.BOUGET@cea.fr> wrote:
>
> Hi,
>
> Developer of robinhood v4 here,
>
> > > > [1] https://github.com/cea-hpc/robinhood/
>
> The sources for version 4 live in a separate branch:
> https://github.com/cea-hpc/robinhood/tree/v4
>
> Any feedback is welcome =)
>
> I am guessing the most interesting bits for this discussion should be found
> here:
> https://github.com/cea-hpc/robinhood/blob/v4/include/robinhood/fsevent.h
>

That is a very well documented API and a valuable resource for me.

Notes for API choices that are aligned with current fanotify plans:
- The combination of parent fid + object fid without name is never expected

Notes for API choices that are NOT aligned with current fanotify plans:
- LINK/UNLINK events carry the linked/unlinked object fid
- XATTR events for inode (not namespace) do not carry parent fid/name

This doesn't mean that fanotify -> rbh_fsevent translation is not going to
be possible.

With fanotify FAN_CREATE event, for example, the parent fid + name
information should be used by the rbh adapter code to call
name_to_handle_at(2) and get the created object's file handle.

The reason we made this API choice is because fanotify events should
not be perceived as a sequence of changes that can be followed to
describe the current state of the filesystem.
fanotify events should be perceived as a "poll" on the namespace.
Whenever notified of a change, application should read the current state
for the filesystem. fanotify events provide "just enough" information, so
reading the current state of the filesystem is not too expensive.

> I am not sure it will matter for the rest of the conversation, but just in case:
>
>     RobinHood v4 has a notion of a "namespace" xattr (like an xattr, but for
>     a dentry rather than an inode), it is used it to store things that are only
>     really tied to the namespace (like the path of an entry). I don't think this
>     is really relevant here, you can probably ignore it.
>
>     Also, RobinHood uses file handles to uniquely identify filesystem entries,
>     and this is what is stored in a `struct rbh_id`.
>

Makes sense.
When fanotify event FAN_MODIFY reports a change of file size,
along with parent fid + name, that do not match the parent/name robinhood
knows about (i.e. because the event is received out of order with rename),
you may use that information to create rbh_fsevent_ns_xattr event to
update the path or you may wait for the FAN_MOVE_SELF event that
should arrive later.
Up to you.

> > > I couldn't find the documentation for Lustre Changelog format, because
> > > the name of the feature is not very Google friendly.
>
> Yes, this is really unfortunate. For the record, user documentation for Lustre
> lives at: http://doc.lustre.org/lustre_manual.xhtml
>
> Chapter 12.1 deals with "Lustre Changelogs" (not much more there than
> what Dominique already wrote).
>

Thanks for the link.

> > > There is one critical difference between a changelog and fanotify events.
> > > fanotify events are delivered a-synchronically and may be delivered out
> > > of order, so application must not rely on path information to update
> > > internal records without using fstatat(2) to check the actual state of the
> > > object in the filesystem.
>
> > lustre changelogs are asynchronous but the order is guaranteed so we
> > might rely on that for robinhood v4,
>
> Yes, we do. At least to a certain extent : we at least expect changelog records
> for a single filesystem entry to be emitted in the order they happened on the
> FS. I have not really given much thought to how things would work in general
> if that wasn't true, but I know this is an issue for things that deal with the
> namespace : https://jira.whamcloud.com/browse/LU-12574
>

I think we may consider in the future, since renaming directories outside
of their parent is done in the kernel under a filesystem wide lock, to
provide a new event FAN_DIR_MOVE which is not merged and not
re-odered with other FAN_DIR_MOVE events. We may even be able to
go one step further and say that FAN_DIR_MOVE is a barrier with which
no event inside the moved dir can be re-ordered, but at the moment,
there is no such guaranty for any type of event.

> > but full path is not computed from
> > information in the changelogs. Instead the design plan is to have a
> > process scrub the database for files that got updated since the last
> > path update and fix paths with fstatat, so I think it might work ; but
> > that unfortunately hasn't been implemented yet.
>
> Not exactly (I am not sure it really matters, so I'll try to be brief).
>
> The idea to keep paths in sync with what's in the filesystem is to "tag"
> entries as we update their name (ie. after a rename). Then a separate
> process comes in, queries for entries that have that "tag", and updates
> their path by concatenating their parent's path (if the parents themselves
> are not "tagged") with the entries' own, up-to-date name. After that, if
> the entry was a directory, its children are "tagged". I simplified a bit, but
> that's the idea.
>

Nice. thanks for explaining that.
I suppose you need to store the calculated path attribute for things like
index queries on the database?

> So, to be fair, full paths _are_ computed solely from information in the
> changelog records, even though it requires a bit of processing on the side.
> No additional query to the filesystem for that.
>

As I wrote, that fact that robinhood trusts the information in changelog
records doesn't mean that information needs to arrive from the kernel.
The adapter code should use information provided by fanotify events
then use open_by_handle_at(2) for directory fid to finds its current
path in the filesystem then feed that information to a robinhood change
record.

I would be happy to work with you on a POC for adapting fanotify
test code with robinhood v4, but before I invest time on that, I would
need to know there is a good chance that people are going to test and
use robinhood with Linux vfs.

May I ask, what is the reason for embarking on the project to decouple
robinhood v4 API from Lustre changelog API?
Is it because you had other fsevent producers in mind?
Do you have actual users requesting to use robinhood with non-Lustre fs?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: robinhood, fanotify name info events and lustre changelog
  2020-05-30 13:07         ` Amir Goldstein
@ 2020-05-30 13:39           ` Dominique Martinet
  2020-05-30 20:37             ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Dominique Martinet @ 2020-05-30 13:39 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Quentin.BOUGET, Jan Kara, linux-fsdevel, linux-api, robinhood-devel

Answering what I can until Quentin chips back in.

Amir Goldstein wrote on Sat, May 30, 2020:
> Nice. thanks for explaining that.
> I suppose you need to store the calculated path attribute for things like
> index queries on the database?

Either querying for a subtree or simply printing the path (rbh-find
would print path by default, like find does)

> > So, to be fair, full paths _are_ computed solely from information in the
> > changelog records, even though it requires a bit of processing on the side.
> > No additional query to the filesystem for that.
> 
> As I wrote, that fact that robinhood trusts the information in changelog
> records doesn't mean that information needs to arrive from the kernel.
> The adapter code should use information provided by fanotify events
> then use open_by_handle_at(2) for directory fid to finds its current
> path in the filesystem then feed that information to a robinhood change
> record.

I can agree with that - it's not because for lustre we made the decision
to be able to run without querying the filesystem at all that it has to
hold true for all type of inputs.

> I would be happy to work with you on a POC for adapting fanotify
> test code with robinhood v4, but before I invest time on that, I would
> need to know there is a good chance that people are going to test and
> use robinhood with Linux vfs.
>
> Do you have actual users requesting to use robinhood with non-Lustre
> fs?

I would run it at home, but that isn't much :D
As I wrote previously we have users for large nfs shares out of lustre,
but I honestly don't think there will be much use for local filesystems
at least in the short term.

Filesystem indexers like tracker[1] or similar would definitely get much
more use for that; from an objective point of view I wouldn't suggest
you spend time on robinhood for this: local filesytems are rarely large
enough to warrant using something like robinhood, and without something
like fanotify we wouldn't be efficient for a local disk with hundreds of
millions of files anyway because of the prohibitive rescan cost - so
it's a bit like chicken and egg maybe, I don't know, but if you want
many users to test different configurations I wouldn't recommend
robinhood (OTOH, we run CI tests so would be happy to add that to the
tests once it's available on vanilla kernels; but that's still not real
users)

[1] https://wiki.gnome.org/Projects/Tracker


> May I ask, what is the reason for embarking on the project to decouple
> robinhood v4 API from Lustre changelog API?
> Is it because you had other fsevent producers in mind?

I've been planning to at least add some recursive-inotifywatch a
subfolder at least (like watchman does) before these new fanotify events
came up, so I might be partly to blame for that.

There also are advantages for lustre though; the point is to be able to
ingest changelogs directly with some daemon (it's only at proof of
concept level for v4 at this point), but also to split the load by
involving multiple lustre clients.
So you would get a pool of lustre clients to read changelogs, a pool of
lustre clients to stat files as required to enrich the fsevents (file
size etc), and a pool of servers to read fsevents and commit changes to
the database (this part is still at the design level afaik)


Hope this all makes sense,
-- 
Dominique

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: robinhood, fanotify name info events and lustre changelog
  2020-05-30 13:39           ` Dominique Martinet
@ 2020-05-30 20:37             ` Amir Goldstein
  2020-06-01 19:46               ` Quentin.BOUGET
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2020-05-30 20:37 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Quentin.BOUGET, Jan Kara, linux-fsdevel, linux-api, robinhood-devel

> > I would be happy to work with you on a POC for adapting fanotify
> > test code with robinhood v4, but before I invest time on that, I would
> > need to know there is a good chance that people are going to test and
> > use robinhood with Linux vfs.
> >
> > Do you have actual users requesting to use robinhood with non-Lustre
> > fs?
>
> I would run it at home, but that isn't much :D
> As I wrote previously we have users for large nfs shares out of lustre,
> but I honestly don't think there will be much use for local filesystems
> at least in the short term.
>
> Filesystem indexers like tracker[1] or similar would definitely get much
> more use for that; from an objective point of view I wouldn't suggest
> you spend time on robinhood for this: local filesytems are rarely large
> enough to warrant using something like robinhood, and without something
> like fanotify we wouldn't be efficient for a local disk with hundreds of
> millions of files anyway because of the prohibitive rescan cost - so
> it's a bit like chicken and egg maybe,

I very much agree with that statement.
I have written the kernel side to facilitate a file server system with many
millions of files. We use in-house software for the user side, not very
different in concept from robinhood.
So I am looking for a similar use case out there using open source
software, and they are not easy to find.

> I don't know, but if you want
> many users to test different configurations I wouldn't recommend
> robinhood (OTOH, we run CI tests so would be happy to add that to the
> tests once it's available on vanilla kernels; but that's still not real
> users)
>
> [1] https://wiki.gnome.org/Projects/Tracker
>

The problem with Track/Watchman is that they are running as
as unprivileged services per user and fanotify requires
CAP_SYS_ADMIN (for good reasons).
Also, if they are not used for watching very large scale of directories,
there is no strong incentive to switch from inotify to fanotify.

My plan was to create a privileged system watchman service that
feeds off of fanotify and serves unprivileged watchman services.
This is not unlike MacOS fseventsd.
I never got around to asses the size of that task.

Looking at robinhood (especially v4), I seems like it could fit
very well into the vacuum in Linux and act as "fsnotifyd".
unprivileged applications and services could register to event streams
and get fed from db, so applications not running will not loose events.
Events delivered to unprivileged applications need to be filtered by
subtree those applications, something that fanotify does not do and
will not likely do and filtered by access permissions of application
to the path of the reported object.

This is not going to be an easy task, but without it, fanotify can serve
some niche use cases and not be as helpful to the wider community.

>
> > May I ask, what is the reason for embarking on the project to decouple
> > robinhood v4 API from Lustre changelog API?
> > Is it because you had other fsevent producers in mind?
>
> I've been planning to at least add some recursive-inotifywatch a
> subfolder at least (like watchman does) before these new fanotify events
> came up, so I might be partly to blame for that.
>
> There also are advantages for lustre though; the point is to be able to
> ingest changelogs directly with some daemon (it's only at proof of
> concept level for v4 at this point), but also to split the load by
> involving multiple lustre clients.
> So you would get a pool of lustre clients to read changelogs, a pool of
> lustre clients to stat files as required to enrich the fsevents (file
> size etc), and a pool of servers to read fsevents and commit changes to
> the database (this part is still at the design level afaik)
>

Sounds interesting. I hope I was able to plant enough seeds
in your mind to steer robinhood in the direction of a future "fsnotifyd" ;-)

Anyway, I will CC you with new posting of my work, so if you want
you can take it for a test drive.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: robinhood, fanotify name info events and lustre changelog
  2020-05-30 20:37             ` Amir Goldstein
@ 2020-06-01 19:46               ` Quentin.BOUGET
  2020-06-01 20:20                 ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Quentin.BOUGET @ 2020-06-01 19:46 UTC (permalink / raw)
  To: Amir Goldstein, Dominique Martinet
  Cc: Jan Kara, linux-fsdevel, linux-api, robinhood-devel

> > > > I am guessing the most interesting bits for this discussion should be found
> > > > here:
> > > > https://github.com/cea-hpc/robinhood/blob/v4/include/robinhood/fsevent.h
> > > >
> > > 
> > > That is a very well documented API and a valuable resource for me.

Thank you!

> > > Notes for API choices that are aligned with current fanotify plans:
> > > - The combination of parent fid + object fid without name is never expected
> > > 
> > > Notes for API choices that are NOT aligned with current fanotify plans:
> > > - LINK/UNLINK events carry the linked/unlinked object fid
> > > - XATTR events for inode (not namespace) do not carry parent fid/name
> > > 
> > > This doesn't mean that fanotify -> rbh_fsevent translation is not going to
> > > be possible.
> > > 
> > > With fanotify FAN_CREATE event, for example, the parent fid + name
> > > information should be used by the rbh adapter code to call
> > > name_to_handle_at(2) and get the created object's file handle.
> > > 
> > > The reason we made this API choice is because fanotify events should
> > > not be perceived as a sequence of changes that can be followed to
> > > describe the current state of the filesystem.
> > > fanotify events should be perceived as a "poll" on the namespace.
> > > Whenever notified of a change, application should read the current state
> > > for the filesystem. fanotify events provide "just enough" information, so
> > > reading the current state of the filesystem is not too expensive.

I am a little worried about objects that would move around constantly and thus
"evade" name_to_handle_at(). A bad actor could try to hide a setuid binary this
way... Of course they could also just copy/delete the file repeatedly and in
this case having the fid becomes useless, but it seems harder to do, and it is
likely it would take more time than a simple rename.

> > > When fanotify event FAN_MODIFY reports a change of file size,
> > > along with parent fid + name, that do not match the parent/name robinhood
> > > knows about (i.e. because the event is received out of order with rename),
> > > you may use that information to create rbh_fsevent_ns_xattr event to
> > > update the path or you may wait for the FAN_MOVE_SELF event that
> > > should arrive later.
> > > Up to you.

This is making me think: if I receive such a FAN_MODIFY event, and an object
is moved at parent_fid + name before I query the FS, how can I know which file
the event was originally meant for?

> > > > So, to be fair, full paths _are_ computed solely from information in the
> > > > changelog records, even though it requires a bit of processing on the side.
> > > > No additional query to the filesystem for that.
> > > 
> > > As I wrote, that fact that robinhood trusts the information in changelog
> > > records doesn't mean that information needs to arrive from the kernel.
> > > The adapter code should use information provided by fanotify events
> > > then use open_by_handle_at(2) for directory fid to finds its current
> > > path in the filesystem then feed that information to a robinhood change
> > > record.
> > 
> > I can agree with that - it's not because for lustre we made the decision
> > to be able to run without querying the filesystem at all that it has to
> > hold true for all type of inputs.

I agree as well. The issue I mention above is a special case. In general, I am
fine with the "just enough information" approach.

> > > May I ask, what is the reason for embarking on the project to decouple
> > > robinhood v4 API from Lustre changelog API?

There is an impedance mismatch between what Lustre emits, and what robinhood
needs for its updates: even with Lustre's changelog, we still need to query
the filesystem to get additional information. I could have extended Lustre's
structures, but then I would have depended on them too much for my taste. It
just seemed cleaner to have a clear separation between the two.

> Looking at robinhood (especially v4), I seems like it could fit
> very well into the vacuum in Linux and act as "fsnotifyd".
> unprivileged applications and services could register to event streams
> and get fed from db, so applications not running will not loose events.
> Events delivered to unprivileged applications need to be filtered by
> subtree those applications, something that fanotify does not do and
> will not likely do and filtered by access permissions of application
> to the path of the reported object.

The plan is to use a dedicated message queue for the streaming part (such as
Kafka or RabbitMQ) and robinhood would only really deal with serializing events
into a standard communication format (the current target is YAML), and dumping
that into the message queues.

From there, it's definitely possible to write a program that will filter
events and route them to unprivileged applications... But it is unlikely I will
write it myself. =)

Cheers,
Quentin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: robinhood, fanotify name info events and lustre changelog
  2020-06-01 19:46               ` Quentin.BOUGET
@ 2020-06-01 20:20                 ` Amir Goldstein
  2020-06-02  1:30                   ` Quentin.BOUGET
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2020-06-01 20:20 UTC (permalink / raw)
  To: Quentin.BOUGET
  Cc: Dominique Martinet, Jan Kara, linux-fsdevel, linux-api, robinhood-devel

> > > > With fanotify FAN_CREATE event, for example, the parent fid + name
> > > > information should be used by the rbh adapter code to call
> > > > name_to_handle_at(2) and get the created object's file handle.
> > > >
> > > > The reason we made this API choice is because fanotify events should
> > > > not be perceived as a sequence of changes that can be followed to
> > > > describe the current state of the filesystem.
> > > > fanotify events should be perceived as a "poll" on the namespace.
> > > > Whenever notified of a change, application should read the current state
> > > > for the filesystem. fanotify events provide "just enough" information, so
> > > > reading the current state of the filesystem is not too expensive.
>
> I am a little worried about objects that would move around constantly and thus
> "evade" name_to_handle_at(). A bad actor could try to hide a setuid binary this
> way... Of course they could also just copy/delete the file repeatedly and in
> this case having the fid becomes useless, but it seems harder to do, and it is
> likely it would take more time than a simple rename.
>

I am not following. This threat model sounds bogus, but I am not a security
expert, and fanotify async events shouldn't have anything to do with security.

If you can write a concrete use case and explain how your application
wants to handle it and why it cannot without the missing object fid information
I get give a serious answer.

> > > > When fanotify event FAN_MODIFY reports a change of file size,
> > > > along with parent fid + name, that do not match the parent/name robinhood
> > > > knows about (i.e. because the event is received out of order with rename),
> > > > you may use that information to create rbh_fsevent_ns_xattr event to
> > > > update the path or you may wait for the FAN_MOVE_SELF event that
> > > > should arrive later.
> > > > Up to you.
>
> This is making me think: if I receive such a FAN_MODIFY event, and an object
> is moved at parent_fid + name before I query the FS, how can I know which file
> the event was originally meant for?
>

FAN_MODIFY/FAN_ACCESS/FAN_ATTRIB events do have the object_fid in
addition to parent_fid + name.
FAN_CREATE/FAN_DELETE/FAN_MOVE do NOT have the object_fid,
FAN_DELETE_SELF/FAN_MOVE_SELF do have the object_fid
FAN_DELETE_SELF does NOT have parent_fid + name
FAN_MOVE_SELF does have parent_fid + name (new parent/name)

Is there anything missing in your opnion for robinhood to be able to
perform any of its missions?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: robinhood, fanotify name info events and lustre changelog
  2020-06-01 20:20                 ` Amir Goldstein
@ 2020-06-02  1:30                   ` Quentin.BOUGET
  0 siblings, 0 replies; 10+ messages in thread
From: Quentin.BOUGET @ 2020-06-02  1:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dominique Martinet, Jan Kara, linux-fsdevel, linux-api, robinhood-devel

> > > > > With fanotify FAN_CREATE event, for example, the parent fid + name
> > > > > information should be used by the rbh adapter code to call
> > > > > name_to_handle_at(2) and get the created object's file handle.
> > > > >
> > > > > The reason we made this API choice is because fanotify events should
> > > > > not be perceived as a sequence of changes that can be followed to
> > > > > describe the current state of the filesystem.
> > > > > fanotify events should be perceived as a "poll" on the namespace.
> > > > > Whenever notified of a change, application should read the current state
> > > > > for the filesystem. fanotify events provide "just enough" information, so
> > > > > reading the current state of the filesystem is not too expensive.
> >
> > I am a little worried about objects that would move around constantly and thus
> > "evade" name_to_handle_at(). A bad actor could try to hide a setuid binary this
> > way... Of course they could also just copy/delete the file repeatedly and in
> > this case having the fid becomes useless, but it seems harder to do, and it is
> > likely it would take more time than a simple rename.
> >
> 
> I am not following. This threat model sounds bogus, but I am not a security
> expert, and fanotify async events shouldn't have anything to do with security.
> 
> If you can write a concrete use case and explain how your application
> wants to handle it and why it cannot without the missing object fid information
> I get give a serious answer.

A few weeks ago, attacks on supercomputers were reported:
https://www.bbc.com/news/technology-52709660.

I am not privy to the mitigations/detection mechanisms put in place, but it is
my understanding that one thing people have been looking for are setuid/setgid
binaries. If robinhood can be trusted to "see" (and stat) every file
created/modified on a filesystem, then it can be used for a rapid
filesystem-wide scan.

EDIT: That's my bad, I should have tried fanotify first. Now that I have, I can
see that FAN_CREATE is not the only event that is emitted when a file is created
and so, even if robinhood does not see the "right" file at parent_fid + name, it
will still see the created file's fid later on as it receives the associated
FAN_OPEN event.

Sorry.

> > > > > When fanotify event FAN_MODIFY reports a change of file size,
> > > > > along with parent fid + name, that do not match the parent/name robinhood
> > > > > knows about (i.e. because the event is received out of order with rename),
> > > > > you may use that information to create rbh_fsevent_ns_xattr event to
> > > > > update the path or you may wait for the FAN_MOVE_SELF event that
> > > > > should arrive later.
> > > > > Up to you.
> >
> > This is making me think: if I receive such a FAN_MODIFY event, and an object
> > is moved at parent_fid + name before I query the FS, how can I know which file
> > the event was originally meant for?
> >
> 
> FAN_MODIFY/FAN_ACCESS/FAN_ATTRIB events do have the object_fid in
> addition to parent_fid + name.
> FAN_CREATE/FAN_DELETE/FAN_MOVE do NOT have the object_fid,
> FAN_DELETE_SELF/FAN_MOVE_SELF do have the object_fid
> FAN_DELETE_SELF does NOT have parent_fid + name
> FAN_MOVE_SELF does have parent_fid + name (new parent/name)
>
> Is there anything missing in your opnion for robinhood to be able to
> perform any of its missions?

No, I don't think so anymore.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-06-02  1:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-27 17:21 [GIT PULL] Fanotify revert for 5.7-rc8 Jan Kara
2020-05-27 18:10 ` pr-tracker-bot
     [not found] ` <20200527173937.GA17769@nautica>
     [not found]   ` <CAOQ4uxjQXwTo1Ug4jY1X+eBdLj80rEfJ0X3zKRi+L8L_uYSrgQ@mail.gmail.com>
2020-05-28 12:56     ` robinhood, fanotify name info events and lustre changelog Dominique Martinet
2020-05-29 18:41       ` Quentin.BOUGET
2020-05-30 13:07         ` Amir Goldstein
2020-05-30 13:39           ` Dominique Martinet
2020-05-30 20:37             ` Amir Goldstein
2020-06-01 19:46               ` Quentin.BOUGET
2020-06-01 20:20                 ` Amir Goldstein
2020-06-02  1:30                   ` Quentin.BOUGET

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.