All of lore.kernel.org
 help / color / mirror / Atom feed
* thoughts about fanotify and HSM
@ 2022-09-11 18:12 Amir Goldstein
  2022-09-12 12:57 ` Jan Kara
  2022-09-21 23:27 ` Dave Chinner
  0 siblings, 2 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-09-11 18:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

Hi Jan,

I wanted to consult with you about preliminary design thoughts
for implementing a hierarchical storage manager (HSM)
with fanotify.

I have been in contact with some developers in the past
who were interested in using fanotify to implement HSM
(to replace old DMAPI implementation).

Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
should be enough to implement a basic HSM, but it is not
sufficient for implementing more advanced HSM features.

Some of the HSM feature that I would like are:
- blocking hook before access to file range and fill that range
- blocking hook before lookup of child and optionally create child

My thoughts on the UAPI were:
- Allow new combination of FAN_CLASS_PRE_CONTENT
  and FAN_REPORT_FID/DFID_NAME
- This combination does not allow any of the existing events
  in mask
- It Allows only new events such as FAN_PRE_ACCESS
  FAN_PRE_MODIFY and FAN_PRE_LOOKUP
- FAN_PRE_ACCESS and FAN_PRE_MODIFY can have
  optional file range info
- All the FAN_PRE_ events are called outside vfs locks and
  specifically before sb_writers lock as in my fsnotify_pre_modify [1]
  POC

That last part is important because the HSM daemon will
need to make modifications to the accessed file/directory
before allowing the operation to proceed.

Naturally that opens the possibility for new userspace
deadlocks. Nothing that is not already possible with permission
event, but maybe deadlocks that are more inviting to trip over.

I am not sure if we need to do anything about this, but we
could make it easier to ignore events from the HSM daemon
itself if we want to, to make the userspace implementation easier.

Another thing that might be good to do is provide an administrative
interface to iterate and abort pending fanotify permission/pre-content
events.

You must have noticed the overlap between my old persistent
change tracking journal and this design. The referenced branch
is from that old POC.

I do believe that the use cases somewhat overlap and that the
same building blocks could be used to implement a persistent
change journal in userspace as you suggested back then.

Thoughts?

Amir.

[1] https://github.com/amir73il/linux/commits/fsnotify_pre_modify

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-11 18:12 thoughts about fanotify and HSM Amir Goldstein
@ 2022-09-12 12:57 ` Jan Kara
  2022-09-12 16:38   ` Amir Goldstein
  2022-09-21 23:27 ` Dave Chinner
  1 sibling, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-09-12 12:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

Hi Amir!

On Sun 11-09-22 21:12:06, Amir Goldstein wrote:
> I wanted to consult with you about preliminary design thoughts
> for implementing a hierarchical storage manager (HSM)
> with fanotify.
> 
> I have been in contact with some developers in the past
> who were interested in using fanotify to implement HSM
> (to replace old DMAPI implementation).

Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels
;)

So how serious are these guys about HSM and investing into it? Because
kernel is going to be only a small part of what's needed for it to be
useful and we've dropped DMAPI from SUSE kernels because the code was
painful to carry (and forwardport + it was not of great quality) and the
demand for it was not really big... So I'd prefer to avoid the major API
extension unless there are serious users out there - perhaps we will even
need to develop the kernel API in cooperation with the userspace part to
verify the result is actually usable and useful. But for now we can take it
as an interesting mental excercise ;)

> Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
> should be enough to implement a basic HSM, but it is not
> sufficient for implementing more advanced HSM features.
> 
> Some of the HSM feature that I would like are:
> - blocking hook before access to file range and fill that range
> - blocking hook before lookup of child and optionally create child
> 
> My thoughts on the UAPI were:
> - Allow new combination of FAN_CLASS_PRE_CONTENT
>   and FAN_REPORT_FID/DFID_NAME
> - This combination does not allow any of the existing events
>   in mask
> - It Allows only new events such as FAN_PRE_ACCESS
>   FAN_PRE_MODIFY and FAN_PRE_LOOKUP
> - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have
>   optional file range info
> - All the FAN_PRE_ events are called outside vfs locks and
>   specifically before sb_writers lock as in my fsnotify_pre_modify [1]
>   POC
> 
> That last part is important because the HSM daemon will
> need to make modifications to the accessed file/directory
> before allowing the operation to proceed.

My main worry here would be that with FAN_FILESYSTEM marks, there will be
far to many events (especially for the lookup & access cases) to reasonably
process. And since the events will be blocking, the impact on performance
will be large.

I think that a reasonably efficient HSM will have to stay in the kernel
(without generating work for userspace) for the "nothing to do" case. And
only in case something needs to be migrated, event is generated and
userspace gets involved. But it isn't obvious to me how to do this with
fanotify (I could imagine it with say overlayfs which is kind of HSM
solution already ;)).

> Naturally that opens the possibility for new userspace
> deadlocks. Nothing that is not already possible with permission
> event, but maybe deadlocks that are more inviting to trip over.
> 
> I am not sure if we need to do anything about this, but we
> could make it easier to ignore events from the HSM daemon
> itself if we want to, to make the userspace implementation easier.

So if the events happen only in the "migration needed" case, I don't think
deadlocks would be too problematic - it just requires a bit of care from
userspace so that the event processing & migration processes do not access
HSM managed stuff.

> Another thing that might be good to do is provide an administrative
> interface to iterate and abort pending fanotify permission/pre-content
> events.

You can always kill the listener. Or are you worried about cases where it
sleeps in UN state?

> You must have noticed the overlap between my old persistent
> change tracking journal and this design. The referenced branch
> is from that old POC.
> 
> I do believe that the use cases somewhat overlap and that the
> same building blocks could be used to implement a persistent
> change journal in userspace as you suggested back then.
> 
> Thoughts?

Yes, there is some overlap. But OTOH HSM seems to require more detailed and
generally more frequent events which seems like a challenge.

> [1] https://github.com/amir73il/linux/commits/fsnotify_pre_modify

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-12 12:57 ` Jan Kara
@ 2022-09-12 16:38   ` Amir Goldstein
       [not found]     ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-09-12 16:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Miklos Szeredi

On Mon, Sep 12, 2022 at 3:57 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
> On Sun 11-09-22 21:12:06, Amir Goldstein wrote:
> > I wanted to consult with you about preliminary design thoughts
> > for implementing a hierarchical storage manager (HSM)
> > with fanotify.
> >
> > I have been in contact with some developers in the past
> > who were interested in using fanotify to implement HSM
> > (to replace old DMAPI implementation).
>
> Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels
> ;)
>
> So how serious are these guys about HSM and investing into it? Because

Let's put it this way.
They had to find a replacement for DMAPI so that they could stop
carrying DMAPI patches, so pretty serious.
They had to do it one way or the other.

They approached me around the time that FAN_MARK_FILESYSTEM
was merged, so I explained them how to implement HSM using
FAN_MARK_FILESYSTEM+FAN_OPEN_PERM
Whether they ended up using it or not - I don't know.

But I do know for a fact that there are several companies out there
implementing HSM to tier local storage to cloud and CTERA is one of
those companies.

We use FUSE to implement HSM and I have reason to believe that
other companies do that as well.

> kernel is going to be only a small part of what's needed for it to be
> useful and we've dropped DMAPI from SUSE kernels because the code was
> painful to carry (and forwardport + it was not of great quality) and the
> demand for it was not really big... So I'd prefer to avoid the major API
> extension unless there are serious users out there - perhaps we will even
> need to develop the kernel API in cooperation with the userspace part to
> verify the result is actually usable and useful. But for now we can take it
> as an interesting mental excercise ;)
>

Certainly. Let's make this a "Call for Users".

> > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
> > should be enough to implement a basic HSM, but it is not
> > sufficient for implementing more advanced HSM features.
> >
> > Some of the HSM feature that I would like are:
> > - blocking hook before access to file range and fill that range
> > - blocking hook before lookup of child and optionally create child
> >
> > My thoughts on the UAPI were:
> > - Allow new combination of FAN_CLASS_PRE_CONTENT
> >   and FAN_REPORT_FID/DFID_NAME
> > - This combination does not allow any of the existing events
> >   in mask
> > - It Allows only new events such as FAN_PRE_ACCESS
> >   FAN_PRE_MODIFY and FAN_PRE_LOOKUP
> > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have
> >   optional file range info
> > - All the FAN_PRE_ events are called outside vfs locks and
> >   specifically before sb_writers lock as in my fsnotify_pre_modify [1]
> >   POC
> >
> > That last part is important because the HSM daemon will
> > need to make modifications to the accessed file/directory
> > before allowing the operation to proceed.
>
> My main worry here would be that with FAN_FILESYSTEM marks, there will be
> far to many events (especially for the lookup & access cases) to reasonably
> process. And since the events will be blocking, the impact on performance
> will be large.
>

Right. That problem needs to be addressed.

> I think that a reasonably efficient HSM will have to stay in the kernel
> (without generating work for userspace) for the "nothing to do" case. And
> only in case something needs to be migrated, event is generated and
> userspace gets involved. But it isn't obvious to me how to do this with
> fanotify (I could imagine it with say overlayfs which is kind of HSM
> solution already ;)).
>

Yeh, overlayfs was on my radar for this as well ;)
as well as eBPF-FUSE:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxh2OZ_AMp6XRcMy0ZtjkQnBfBZFhH0t-+Pd298uPuSEVw@mail.gmail.com/

There is an existing HSM solution out there that this proposal is modeled after:
https://docs.microsoft.com/en-us/windows/win32/projfs/projected-file-system

The Windows ProjFS hooks into userspace for specific files that were
marked some way.

There was an attempt to implement this interface for Linux but this
project is discontinued:
https://github.com/github/libprojfs/

To do the same thing with fanotify we would either need to define
a new inode flag (i.e. chattr) for this purpose or (much better)
implement an eBPF filter for FAN_MARK_FILESYSTEM, so that
every HSM can choose whichever inode flag it wants.

> > Naturally that opens the possibility for new userspace
> > deadlocks. Nothing that is not already possible with permission
> > event, but maybe deadlocks that are more inviting to trip over.
> >
> > I am not sure if we need to do anything about this, but we
> > could make it easier to ignore events from the HSM daemon
> > itself if we want to, to make the userspace implementation easier.
>
> So if the events happen only in the "migration needed" case, I don't think
> deadlocks would be too problematic - it just requires a bit of care from
> userspace so that the event processing & migration processes do not access
> HSM managed stuff.
>

Right.

> > Another thing that might be good to do is provide an administrative
> > interface to iterate and abort pending fanotify permission/pre-content
> > events.
>
> You can always kill the listener. Or are you worried about cases where it
> sleeps in UN state?
>

FUSE, which is technically also a userspace hook for filesystem operations
can end up in a state where the daemon cannot be killed to release a
deadlock which is why the administrative abort interface is needed:
https://www.kernel.org/doc/html/latest/filesystems/fuse.html#aborting-a-filesystem-connection

I am not sure if we need to worry about this if the fanotify hooks are
called outside of vfs locks (?).

> > You must have noticed the overlap between my old persistent
> > change tracking journal and this design. The referenced branch
> > is from that old POC.
> >
> > I do believe that the use cases somewhat overlap and that the
> > same building blocks could be used to implement a persistent
> > change journal in userspace as you suggested back then.
> >
> > Thoughts?
>
> Yes, there is some overlap. But OTOH HSM seems to require more detailed and
> generally more frequent events which seems like a challenge.
>

I should have mentioned this in my proposal, but you are right -
kernel filtering is an essential part of implementing HSM using fanotify.

The "swap in" should actually be an uncommon case and
after it happens on a certain file that file becomes "warm"
and then "swap out" of that file would be discouraged.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
       [not found]     ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
@ 2022-09-13  2:41       ` Amir Goldstein
  0 siblings, 0 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-09-13  2:41 UTC (permalink / raw)
  To: Plaster, Robert; +Cc: Jan Kara, linux-fsdevel, Miklos Szeredi

> From: Amir Goldstein <amir73il@gmail.com>
> Date: Monday, September 12, 2022 at 9:38 AM
> To: Jan Kara <jack@suse.cz>
> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>, Miklos Szeredi <miklos@szeredi.hu>
> Subject: Re: thoughts about fanotify and HSM
>
> On Mon, Sep 12, 2022 at 3:57 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hi Amir!
> >
> > On Sun 11-09-22 21:12:06, Amir Goldstein wrote:
> > > I wanted to consult with you about preliminary design thoughts
> > > for implementing a hierarchical storage manager (HSM)
> > > with fanotify.
> > >
> > > I have been in contact with some developers in the past
> > > who were interested in using fanotify to implement HSM
> > > (to replace old DMAPI implementation).
> >
> > Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels
> > ;)
> >
> > So how serious are these guys about HSM and investing into it? Because
>
> Let's put it this way.
> They had to find a replacement for DMAPI so that they could stop
> carrying DMAPI patches, so pretty serious.
> They had to do it one way or the other.
>
> They approached me around the time that FAN_MARK_FILESYSTEM
> was merged, so I explained them how to implement HSM using
> FAN_MARK_FILESYSTEM+FAN_OPEN_PERM
> Whether they ended up using it or not - I don't know.
>


On Tue, Sep 13, 2022 at 2:01 AM Plaster, Robert
<rplaster@deepspacestorage.com> wrote:
>
> Hi Amir – Dan got back to me. He said (fyi - SMS referenced below is our HSM app):
>

Hi Rob,

I will add that from what I read on your website [1], your entire
product is open source,
code is available on your web servers as rpms and will be put up on GitHub soon.
That's very good news, because it means I will be able to demo my proposed
fanotify interface improvements on your code base :)

[1] https://deepspacestorage.com/resources/#downloads

>
>
> “Amir talks about specific fanotify events used for an HSM.  He says FAN_MARK_FILESYSTEM+FAN_OPEN_PERM should be enough for a basic HSM.  As it is currently implemented, SMS uses FAN_MARK_ADD+FAN_OPEN_PERM to detect purged files and FAN_MARK_ADD+FAN_CLOSE_WRITE events to determine when a file has been potentially modified.  There are other FANOTIFY events that would be useful, but we're currently limited by the older Linux kernels in the RHEL releases we're supporting.
>
>
>
> If I understand what Amil is proposing, it appears to be some new FANOTIFY FAN_PRE_* events.  Some of it looks like something we would be interested in but as long as we continue to support older RHEL kernels, we're very limited to what we have to work with.”
>

My goal is to design the interfaces for the use of future more
advanced HSM clients.
The experience that you can bring to the table from customers using your
current HSM client is very important for making design choices for future
HSM clients - i.e. understand and address the pain points with current
fanotify interface.

So as long as this is "something that you would be interested in" I know I am
in the right direction ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-12 16:38   ` Amir Goldstein
       [not found]     ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
@ 2022-09-14  7:27     ` Amir Goldstein
  2022-09-14 10:30       ` Jan Kara
       [not found]     ` <BY5PR07MB6529795F49FB4E923AFCB062A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
  2 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-09-14  7:27 UTC (permalink / raw)
  To: Jan Kara, Miklos Szeredi; +Cc: Plaster, Robert, David Howells, linux-fsdevel

On Mon, Sep 12, 2022 at 7:38 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Sep 12, 2022 at 3:57 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hi Amir!
> >
> > On Sun 11-09-22 21:12:06, Amir Goldstein wrote:
> > > I wanted to consult with you about preliminary design thoughts
> > > for implementing a hierarchical storage manager (HSM)
> > > with fanotify.
> > >

I feel that the discussion is losing focus, so let me try to refocus
and list pros and cons for different options for HSM API...

> > > I have been in contact with some developers in the past
> > > who were interested in using fanotify to implement HSM
> > > (to replace old DMAPI implementation).
> >
> > Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels
> > ;)

For the record, DMAPI is still partly supported on some proprietary
filesystems, but even if a full implementation existed, this old API
which was used for tape devices mostly is not a good fit for modern
day cloud storage use cases.

> >
> > So how serious are these guys about HSM and investing into it?
>
> Let's put it this way.
> They had to find a replacement for DMAPI so that they could stop
> carrying DMAPI patches, so pretty serious.
> They had to do it one way or the other.
>

As mentioned earlier, this is an open source HSM project [1]
with a release coming soon that is using FAN_OPEN_PERM
to migrate data from the slower tier.

As you can imagine, FAN_OPEN_PERM can only get you as far
as DMAPI but not beyond and it leaves the problem of setting the
marks on all punched files on bringup.

>
> But I do know for a fact that there are several companies out there
> implementing HSM to tier local storage to cloud and CTERA is one of
> those companies.
>
> We use FUSE to implement HSM and I have reason to believe that
> other companies do that as well.
>

FUSE is the most flexible API to implement HSM, but it suffers
from performance overhead in the "fast" path due to context switches
and cache line bounces.

FUSE_PASSTHROUGH patches [2] address this overhead for
large files IO. I plan to upstream those patches.

FUSE-BPF [3] and former extFUSE [4] projects aim to address this
overhead for readdir and other operations.

This is an alluring option for companies that already use FUSE for HSM,
because they will not need to change their implementation much,
but my gut feeling is that there are interesting corner cases lurking...

> > kernel is going to be only a small part of what's needed for it to be
> > useful and we've dropped DMAPI from SUSE kernels because the code was
> > painful to carry (and forwardport + it was not of great quality) and the
> > demand for it was not really big...

Note that the demand was not big for the crappy DMAPI ;)
it does not say anything about the demand for HSM solutions,
which exists and is growing IMO.

> > So I'd prefer to avoid the major API
> > extension unless there are serious users out there - perhaps we will even
> > need to develop the kernel API in cooperation with the userspace part to
> > verify the result is actually usable and useful.

Yap. It should be trivial to implement a "mirror" HSM backend.
For example, the libprojfs [5] projects implements a MirrorProvider
backend for the Microsoft ProjFS [6] HSM API.

>
> > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
> > > should be enough to implement a basic HSM, but it is not
> > > sufficient for implementing more advanced HSM features.
> > >
[...]
> > My main worry here would be that with FAN_FILESYSTEM marks, there will be
> > far to many events (especially for the lookup & access cases) to reasonably
> > process. And since the events will be blocking, the impact on performance
> > will be large.
> >
>
> Right. That problem needs to be addressed.
>
> > I think that a reasonably efficient HSM will have to stay in the kernel
> > (without generating work for userspace) for the "nothing to do" case. And
> > only in case something needs to be migrated, event is generated and
> > userspace gets involved. But it isn't obvious to me how to do this with
> > fanotify (I could imagine it with say overlayfs which is kind of HSM
> > solution already ;)).
> >

It's true, overlayfs is kind of HSM, but:
- Without swap out to slower tier
- Without user control over method of swap in from slower tier

On another thread regarding FUSE-BPF, Miklos also mentioned
the option to add those features to overlayfs [7] to make it useful
as an HSM kernel driver.

So we have at least three options for an HSM kernel driver (FUSE,
fanotify, overlayfs), but none of them is still fully equipped to drive
a modern HSM implementation.

What is clear is that:
1. The fast path must not context switch to userspace
2. The slow path needs an API for calling into user to migrate files/dirs

What is not clear is:
1. The method to persistently mark files/dirs for fast/slow path
2. The API to call into userspace

Overlayfs provides a method to mark files for slow path
('trusted.overlay.metacopy' xattr), meaning file that has metadata
but not the data, but overlayfs does not provide the API to perform
"user controlled migration" of the data.

Instead of inventing a new API for that, I'd rather extend the
known fanotify protocol and allow the new FAN_XXX_PRE events
only on filesystems that have the concept of a file without its content
(a.k.a. metacopy).

We could say that filesystems that support fscache can also support
FAN_XXX_PRE events, and perhaps cachefilesd could make use of
hooks to implement user modules that populate the fscache objects
out of band.

There is the naive approach to interpret a "punched hole" in a file as
"no content" as DMAPI did, to support FAN_XXX_PRE events on
standard local filesystem (fscache does that internally).
That would be an opt-in via fanotify_init() flag and could be useful for
old DMAPI HSM implementations that are converted to use the new API.

Practically, the filesystems that allow FAN_XXX_PRE events
on punched files would need to advertise this support and maintain
an inode flag (i.e. I_NODATA) to avoid a performance penalty
on every file access. If we take that route, though, it might be better
off to let the HSM daemon set this flag explicitly (e.g. chattr +X)
when punching holes in files and removing the flag explicitly
when filling the holes.

And there is the most flexible option of attaching a BFP filter to
a filesystem mark, but I am afraid that this program will be limited
to using information already in the path/dentry/inode struct.
At least HSM could use an existing arbitrary inode flag
(e.g. chattr+i) as "persistent marks".

So many options! I don't know which to choose :)

If this plan sounds reasonable, I can start with a POC of
"user controlled copy up/down" for overlayfs, using fanotify
as the user notification protocol and see where it goes from there.

Thanks for reading my brain dump ;)

Amir.

[1] https://deepspacestorage.com/
[2] https://lore.kernel.org/linux-fsdevel/20210125153057.3623715-1-balsini@android.com/
[3] https://lpc.events/event/16/contributions/1339/attachments/945/1861/LPC2022%20Fuse-bpf.pdf
[4] https://github.com/extfuse/extfuse
[5] https://github.com/github/libprojfs
[6] https://docs.microsoft.com/en-us/windows/win32/api/_projfs/
[7] https://lore.kernel.org/linux-fsdevel/CAJfpegt4N2nmCQGmLSBB--NzuSSsO6Z0sue27biQd4aiSwvNFw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
       [not found]     ` <BY5PR07MB6529795F49FB4E923AFCB062A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
@ 2022-09-14  9:29       ` Jan Kara
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2022-09-14  9:29 UTC (permalink / raw)
  To: Plaster, Robert; +Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Miklos Szeredi

Hello Robert!

On Mon 12-09-22 21:10:24, Plaster, Robert wrote:
> HI Jan – Our team has been using fanotify for HSM as a DMAPI replacement
> for a while now. We came from StorageTek/Sun/Oracle HSM product line
> development teams. We have been working on this for about 5 years and
> just this month are supporting end-users. DMAPI was a huge stumbling
> block for us. We figured out what minimum set of api calls were needed to
> make it work.
> 
> Our experience with fanotify has been fantastic. Not much overhead CPU
> load but for boot volumes we do filter out events for swap and other
> (there are so many) OS temp files that are really of no concern to HSM.
> We can create as many files as the file system on NVMe can without any
> back-pressure and the HSM process will go as fast as the target media
> supports.

I'm glad to hear fanotify is useful for you.

> We have tested close to 600M files per HSM client and we keep adding
> client files as time permits, we have no coded limits for the number of
> HSM clients or max number of files in the repository. Also, the
> repository for HSM clients is heterogenous so it allows us to push files
> from one client type to another without any transcoding. I asked the guys
> doing the actual fanotify part to comment but they said it would be a
> couple days as they are heads down on a fix for a customer.
> 
> Currently we have HSM and punch-hole running on xfs and tested it on zfs
> (works but client isn’t finalized) and we have Lustre and SpectrumScale
> on our to-do list. Basically any FS with extended attributes should work
> for HSM and some (not all) will work with punch-hole capabilities.
> 
> We have developed a HSM target for certain object stores (Ceph librados
> and we have our own in-house object store) that support stream-IO and of
> course any tape technology. We have a replication tool for making an S3
> target look like the source FS but its just replication, not HSM. Until
> we get a S3 io-streaming we can’t use it for HSM. Our implementation only
> works with our open-source catalog, archive platform. We tried to
> announce this capability to the ceph community but we could never get
> past their gatekeepers so only people we actually talk to know about it.
> 
> Check out our site (kinda sucks and a little markety) but it’s a good
> primer. In it are links to the code and manuals we have done. We have not
> put out on github yet but will very soon. We are getting ready to post
> some big updates to really simplify installation and configuration and
> some bug fixes for some weird edge-cases.

Thanks for info and the links! It is interesting to learn something about
how users are actually using our code :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-14  7:27     ` Amir Goldstein
@ 2022-09-14 10:30       ` Jan Kara
  2022-09-14 11:52         ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-09-14 10:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel

On Wed 14-09-22 10:27:48, Amir Goldstein wrote:
> On Mon, Sep 12, 2022 at 7:38 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > I have been in contact with some developers in the past
> > > > who were interested in using fanotify to implement HSM
> > > > (to replace old DMAPI implementation).
> > >
> > > Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels
> > > ;)
> 
> For the record, DMAPI is still partly supported on some proprietary
> filesystems, but even if a full implementation existed, this old API
> which was used for tape devices mostly is not a good fit for modern
> day cloud storage use cases.

Interesting, I didn't know DMAPI still lives :)

> > > So how serious are these guys about HSM and investing into it?
> >
> > Let's put it this way.
> > They had to find a replacement for DMAPI so that they could stop
> > carrying DMAPI patches, so pretty serious.
> > They had to do it one way or the other.
> >
> 
> As mentioned earlier, this is an open source HSM project [1]
> with a release coming soon that is using FAN_OPEN_PERM
> to migrate data from the slower tier.
> 
> As you can imagine, FAN_OPEN_PERM can only get you as far
> as DMAPI but not beyond and it leaves the problem of setting the
> marks on all punched files on bringup.

Sure I can see that FAN_OPEN_PERM works for basic usage but it certainly
leaves room for improvement :)

<snip nice summary of FUSE options>

> > > So I'd prefer to avoid the major API
> > > extension unless there are serious users out there - perhaps we will even
> > > need to develop the kernel API in cooperation with the userspace part to
> > > verify the result is actually usable and useful.
> 
> Yap. It should be trivial to implement a "mirror" HSM backend.
> For example, the libprojfs [5] projects implements a MirrorProvider
> backend for the Microsoft ProjFS [6] HSM API.

Well, validating that things work using some simple backend is one thing
but we are probably also interested in whether the result is practical to
use - i.e., whether the performance meets the needs, whether the API is not
cumbersome for what HSM solutions need to do, whether the more advanced
features like range-support are useful the way they are implemented etc.
We can verify some of these things with simple mirror HSM backend but I'm
afraid some of the problems may become apparent only once someone actually
uses the result in practice and for that we need a userspace counterpart
that does actually something useful so that people have motivation to use
it :).
 
> > > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
> > > > should be enough to implement a basic HSM, but it is not
> > > > sufficient for implementing more advanced HSM features.
> > > >
> [...]
> > > My main worry here would be that with FAN_FILESYSTEM marks, there will be
> > > far to many events (especially for the lookup & access cases) to reasonably
> > > process. And since the events will be blocking, the impact on performance
> > > will be large.
> > >
> >
> > Right. That problem needs to be addressed.
> >
> > > I think that a reasonably efficient HSM will have to stay in the kernel
> > > (without generating work for userspace) for the "nothing to do" case. And
> > > only in case something needs to be migrated, event is generated and
> > > userspace gets involved. But it isn't obvious to me how to do this with
> > > fanotify (I could imagine it with say overlayfs which is kind of HSM
> > > solution already ;)).
> > >
> 
> It's true, overlayfs is kind of HSM, but:
> - Without swap out to slower tier
> - Without user control over method of swap in from slower tier
> 
> On another thread regarding FUSE-BPF, Miklos also mentioned
> the option to add those features to overlayfs [7] to make it useful
> as an HSM kernel driver.
> 
> So we have at least three options for an HSM kernel driver (FUSE,
> fanotify, overlayfs), but none of them is still fully equipped to drive
> a modern HSM implementation.
> 
> What is clear is that:
> 1. The fast path must not context switch to userspace
> 2. The slow path needs an API for calling into user to migrate files/dirs
> 
> What is not clear is:
> 1. The method to persistently mark files/dirs for fast/slow path
> 2. The API to call into userspace

Agreed.

> Overlayfs provides a method to mark files for slow path
> ('trusted.overlay.metacopy' xattr), meaning file that has metadata
> but not the data, but overlayfs does not provide the API to perform
> "user controlled migration" of the data.
> 
> Instead of inventing a new API for that, I'd rather extend the
> known fanotify protocol and allow the new FAN_XXX_PRE events
> only on filesystems that have the concept of a file without its content
> (a.k.a. metacopy).
> 
> We could say that filesystems that support fscache can also support
> FAN_XXX_PRE events, and perhaps cachefilesd could make use of
> hooks to implement user modules that populate the fscache objects
> out of band.

One possible approach is that we would make these events explicitely
targetted to HSM and generated directly by the filesystem which wants to
support HSM. So basically when the filesystem finds out it needs the data
filled in, it will call something like:

  fsnotify(inode, FAN_PRE_GIVE_ME_DATA, perhaps_some_details_here)

Something like what we currently do for filesystem error events but in this
case the event will work like a permission event. Userspace can be watching
the filesystem with superblock mark to receive these events. The persistent
marking of files is completely left upto the filesystem in this case - it
has to decide when the FAN_PRE_GIVE_ME_DATA event needs to be generated for
an inode.

> There is the naive approach to interpret a "punched hole" in a file as
> "no content" as DMAPI did, to support FAN_XXX_PRE events on
> standard local filesystem (fscache does that internally).
> That would be an opt-in via fanotify_init() flag and could be useful for
> old DMAPI HSM implementations that are converted to use the new API.

I'd prefer to leave these details upto the filesystem wanting to support
HSM and not clutter fanotify API with details about file layout. 

> Practically, the filesystems that allow FAN_XXX_PRE events
> on punched files would need to advertise this support and maintain
> an inode flag (i.e. I_NODATA) to avoid a performance penalty
> on every file access. If we take that route, though, it might be better
> off to let the HSM daemon set this flag explicitly (e.g. chattr +X)
> when punching holes in files and removing the flag explicitly
> when filling the holes.

Again, in what I propose this would be left upto the filesystem - e.g. it
can have inode flag or xattr or something else to carry the information
that this file is under HSM and call fsnotify() when the file is accessed.
It might be challenging to fulfill your desire to generate the event
outside of any filesystem locks with this design though.

> And there is the most flexible option of attaching a BFP filter to
> a filesystem mark, but I am afraid that this program will be limited
> to using information already in the path/dentry/inode struct.
> At least HSM could use an existing arbitrary inode flag
> (e.g. chattr+i) as "persistent marks".
> 
> So many options! I don't know which to choose :)
> 
> If this plan sounds reasonable, I can start with a POC of
> "user controlled copy up/down" for overlayfs, using fanotify
> as the user notification protocol and see where it goes from there.

Yeah, that might be interesting to see as an example. Another example of
"kind-of-hsm" that we already have in the kernel is autofs. So we can think
whether that could be implemented using the scheme we design as an
excercise.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-14 10:30       ` Jan Kara
@ 2022-09-14 11:52         ` Amir Goldstein
  2022-09-20 18:19           ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-09-14 11:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel

> > > > So I'd prefer to avoid the major API
> > > > extension unless there are serious users out there - perhaps we will even
> > > > need to develop the kernel API in cooperation with the userspace part to
> > > > verify the result is actually usable and useful.
> >
> > Yap. It should be trivial to implement a "mirror" HSM backend.
> > For example, the libprojfs [5] projects implements a MirrorProvider
> > backend for the Microsoft ProjFS [6] HSM API.
>
> Well, validating that things work using some simple backend is one thing
> but we are probably also interested in whether the result is practical to
> use - i.e., whether the performance meets the needs, whether the API is not
> cumbersome for what HSM solutions need to do, whether the more advanced
> features like range-support are useful the way they are implemented etc.
> We can verify some of these things with simple mirror HSM backend but I'm
> afraid some of the problems may become apparent only once someone actually
> uses the result in practice and for that we need a userspace counterpart
> that does actually something useful so that people have motivation to use
> it :).

Makes sense.

> > Overlayfs provides a method to mark files for slow path
> > ('trusted.overlay.metacopy' xattr), meaning file that has metadata
> > but not the data, but overlayfs does not provide the API to perform
> > "user controlled migration" of the data.
> >
> > Instead of inventing a new API for that, I'd rather extend the
> > known fanotify protocol and allow the new FAN_XXX_PRE events
> > only on filesystems that have the concept of a file without its content
> > (a.k.a. metacopy).
> >
> > We could say that filesystems that support fscache can also support
> > FAN_XXX_PRE events, and perhaps cachefilesd could make use of
> > hooks to implement user modules that populate the fscache objects
> > out of band.
>
> One possible approach is that we would make these events explicitely
> targetted to HSM and generated directly by the filesystem which wants to
> support HSM. So basically when the filesystem finds out it needs the data
> filled in, it will call something like:
>
>   fsnotify(inode, FAN_PRE_GIVE_ME_DATA, perhaps_some_details_here)
>
> Something like what we currently do for filesystem error events but in this
> case the event will work like a permission event. Userspace can be watching
> the filesystem with superblock mark to receive these events. The persistent
> marking of files is completely left upto the filesystem in this case - it
> has to decide when the FAN_PRE_GIVE_ME_DATA event needs to be generated for
> an inode.
>

Woh! that's DMAPI pain all over again.
I do not want to go there.
I have no capacity nor desire to drive this sort of changes through every
single filesystem and convince the maintainers to support those hooks forever.
(seems that you have experienced this pain yourself).

I would like to be able to have a generic vfs implementation.
I do not want to rely on the filesystem at all to decide when local data
is valid or not for the HSM.

That should be completely up to the HSM implementation to decide.
My example with punched file was just to demonstrate how an
implementation that is DMAPI drop-in replacement would look like.

It might be ok for specific filesystems, e.g. overlayfs, to call
fsnotify(inode, FAN_PRE_XXX,...) e.g. before copy up. That would
be done as the remote notifications we discussed for fuse/cifs.
Still need to think about how remote and local notifications are
distinguished, but at least a bit in the event mask for "remote"
should be enough for the eBPF filter to make the decision.

> > There is the naive approach to interpret a "punched hole" in a file as
> > "no content" as DMAPI did, to support FAN_XXX_PRE events on
> > standard local filesystem (fscache does that internally).
> > That would be an opt-in via fanotify_init() flag and could be useful for
> > old DMAPI HSM implementations that are converted to use the new API.
>
> I'd prefer to leave these details upto the filesystem wanting to support
> HSM and not clutter fanotify API with details about file layout.
>

Not clutter fanotify with details about filesystem - agreed.
Clutter filesystem with details about HSM - disagree.
A BPF filter on fanotify mark with the details about specific HSM
should resolve both our objections.

If one wants to write an HSM that works with "chattr +X" filter
then this HSM will require filesystem support, but there are ways
to implement an HSM that would be mostly filesystem agnostic.

> > Practically, the filesystems that allow FAN_XXX_PRE events
> > on punched files would need to advertise this support and maintain
> > an inode flag (i.e. I_NODATA) to avoid a performance penalty
> > on every file access. If we take that route, though, it might be better
> > off to let the HSM daemon set this flag explicitly (e.g. chattr +X)
> > when punching holes in files and removing the flag explicitly
> > when filling the holes.
>
> Again, in what I propose this would be left upto the filesystem - e.g. it
> can have inode flag or xattr or something else to carry the information
> that this file is under HSM and call fsnotify() when the file is accessed.
> It might be challenging to fulfill your desire to generate the event
> outside of any filesystem locks with this design though.
>

Right. Another reason to dislike fs internal hooks.
I think my eBPF suggestion does not suffer from this problem.

> > And there is the most flexible option of attaching a BFP filter to
> > a filesystem mark, but I am afraid that this program will be limited
> > to using information already in the path/dentry/inode struct.
> > At least HSM could use an existing arbitrary inode flag
> > (e.g. chattr+i) as "persistent marks".
> >
> > So many options! I don't know which to choose :)
> >
> > If this plan sounds reasonable, I can start with a POC of
> > "user controlled copy up/down" for overlayfs, using fanotify
> > as the user notification protocol and see where it goes from there.
>

I am not sure anymore if overlayfs is a good place to start this POC.
I think that an easier POC will be to use an example "demo filter" in C
to check some inode flag, until the eBPF filter is implemented and
write a demo backend to demonstrate the usefulness of the API extensions.

> Yeah, that might be interesting to see as an example. Another example of
> "kind-of-hsm" that we already have in the kernel is autofs. So we can think
> whether that could be implemented using the scheme we design as an
> excercise.
>

An interesting exercise, but so far, fsnotify was not dealing with
namespace changes at all, it was completely restricted to notifying
about filesystems changes, so I don't think now would be a good
time to change that paradigm.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-14 11:52         ` Amir Goldstein
@ 2022-09-20 18:19           ` Amir Goldstein
  2022-09-22 10:48             ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-09-20 18:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel, Fufu Fang

On Wed, Sep 14, 2022 at 2:52 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > > So I'd prefer to avoid the major API
> > > > > extension unless there are serious users out there - perhaps we will even
> > > > > need to develop the kernel API in cooperation with the userspace part to
> > > > > verify the result is actually usable and useful.
> > >
> > > Yap. It should be trivial to implement a "mirror" HSM backend.
> > > For example, the libprojfs [5] projects implements a MirrorProvider
> > > backend for the Microsoft ProjFS [6] HSM API.
> >
> > Well, validating that things work using some simple backend is one thing
> > but we are probably also interested in whether the result is practical to
> > use - i.e., whether the performance meets the needs, whether the API is not
> > cumbersome for what HSM solutions need to do, whether the more advanced
> > features like range-support are useful the way they are implemented etc.
> > We can verify some of these things with simple mirror HSM backend but I'm
> > afraid some of the problems may become apparent only once someone actually
> > uses the result in practice and for that we need a userspace counterpart
> > that does actually something useful so that people have motivation to use
> > it :).
>

Hi Jan,

I wanted to give an update on the POC that I am working on.
I decided to find a FUSE HSM and show how it may be converted
to use fanotify HSM hooks.

HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local
cache from a remote http on first access to every directory and file range.

Normally, it would be run like this:
./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/

Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements
passthrough calls to the local cache dir if cache is already populated.

After my conversion patches [1], this download-only HSM can be run like
this without mounting FUSE:

sudo ./httpdirfs --fanotify --cache-location /vdf/cache
https://cdn.kernel.org/pub/ -

[1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content

Browsing the cache directory at /vdf/cache, lazyly populates the local cache
using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content
using FAN_ACCESS_PERM read hooks.

Up to this point, the implementation did not require any kernel changes.
However, this type of command does not populate the path components,
because lookup does not generate FAN_ACCESS_PERM event:

stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz

To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM
event [2] and used it to lazyly populate directories in the path ancestry.
For now, I stuck with the XXX_PERM convention and did not require
FAN_CLASS_PRE_CONTENT, although we probably should.

[2] https://github.com/amir73il/linux/commits/fanotify_pre_content

Streaming read of large files works as well, but only for sequential read
patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events
do not (yet) carry range info, so my naive implementation downloads
one extra data chunk on each FAN_ACCESS_PERM until the cache file is full.

This makes it possible to run commands like:

tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz
| less

without having to wait for the entire 400MB file to download before
seeing the first page.

This streaming feature is extremely important for modern HSMs
that are often used to archive large media files in the cloud.

For the next steps of POC, I could do:
- Report FAN_ACCESS_PERM range info to implement random read
  patterns (e.g. unzip -l)
- Introduce FAN_MODIFY_PERM, so file content could be downloaded
  before modifying a read-write HSM cache
- Demo conversion of a read-write FUSE HSM implementation
  (e.g. https://github.com/volga629/davfs2)
- Demo HSM with filesystem mark [*] and a hardcoded test filter

[*] Note that unlike the case with recursive inotify, this POC HSM
implementation is not racy, because of the lookup permission events.
A filesystem mark is still needed to avoid pinning all the unpopulated
cache tree leaf entries to inode cache, so that this HSM could work on
a very large scale tree, the same as my original use case for implementing
filesystem mark.

If what you are looking for is an explanation why fanotify HSM would be better
than a FUSE HSM implementation then there are several reasons.
Performance is at the top of the list. There is this famous USENIX paper [3]
about FUSE passthrough performance.
It is a bit outdated, but many parts are still relevant - you can ask
the Android
developers why they decided to work on FUSE-BFP...

[3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

For me, performance is one of the main concerns, but not the only one,
so I am not entirely convinced that a full FUSE-BFP implementation would
solve all my problems.

When scaling to many millions of passthrough inodes, resource usage start
becoming a limitation of a FUSE passthrough implementation and memory
reclaim of native fs works a lot better than memory reclaim over FUSE over
another native fs.

When the workload works on the native filesystem, it is also possible to
use native fs features (e.g. XFS ioctls).

Questions:
- What do you think about the direction this POC has taken so far?
- Is there anything specific that you would like to see in the POC
  to be convinced that this API will be useful?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-11 18:12 thoughts about fanotify and HSM Amir Goldstein
  2022-09-12 12:57 ` Jan Kara
@ 2022-09-21 23:27 ` Dave Chinner
  2022-09-22  4:35   ` Amir Goldstein
  1 sibling, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2022-09-21 23:27 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Sun, Sep 11, 2022 at 09:12:06PM +0300, Amir Goldstein wrote:
> Hi Jan,
> 
> I wanted to consult with you about preliminary design thoughts
> for implementing a hierarchical storage manager (HSM)
> with fanotify.
> 
> I have been in contact with some developers in the past
> who were interested in using fanotify to implement HSM
> (to replace old DMAPI implementation).
> 
> Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
> should be enough to implement a basic HSM, but it is not
> sufficient for implementing more advanced HSM features.

Ah, I wondered where the people with that DMAPI application went all
those years ago after I told them they should look into using
fanotify to replace the dependency they had on the DMAPI patched
XFS that SGI maintained for years for SLES kernels...

> Some of the HSM feature that I would like are:
> - blocking hook before access to file range and fill that range
> - blocking hook before lookup of child and optionally create child

Ok, so these are to replace the DMAPI hooks that provided a blocking
userspace upcall to the HSM to allow it fetch data from offline
teirs that wasn't currently in the filesystem itself. i.e. the inode
currently has a hole over that range of data, but before the read
can proceed the HSM needs to retreive the data from the remote
storage and write it into the local filesystem.

I think that you've missed a bunch of blocking notifications that
are needed, though. e.g. truncate needs to block while the HSM
records the file ranges it is storing offline are now freed.
fallocate() needs to block while it waits for the HSM to tell it the
ranges of the file that actually contain data and so should need to
be taken into account. (e.g. ZERO_RANGE needs to wait for offline
data to be invalidated, COLLAPSE_RANGE needs offline data to be
recalled just like a read() operation, etc).

IOWs, any operation that manipulates the extent map or the data in
the file needs a blocking upcall to the HSM so that it can restore
and invalidate the offline data across the range of the operation
that is about to be performed....

> My thoughts on the UAPI were:
> - Allow new combination of FAN_CLASS_PRE_CONTENT
>   and FAN_REPORT_FID/DFID_NAME
> - This combination does not allow any of the existing events
>   in mask
> - It Allows only new events such as FAN_PRE_ACCESS
>   FAN_PRE_MODIFY and FAN_PRE_LOOKUP
> - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have
>   optional file range info
> - All the FAN_PRE_ events are called outside vfs locks and
>   specifically before sb_writers lock as in my fsnotify_pre_modify [1]
>   POC
> 
> That last part is important because the HSM daemon will
> need to make modifications to the accessed file/directory
> before allowing the operation to proceed.

Yes, and that was the biggest problem with DMAPI - the locking
involved. DMAPI operations have to block without holding any locks
that the IO path, truncate, fallocate, etc might need, but once they
are unblocked they need to regain those locks to allow the operation
to proceed. This was by far the ugliest part of the DMAPI patches,
and ultimately, the reason why it was never merged.

> Naturally that opens the possibility for new userspace
> deadlocks. Nothing that is not already possible with permission
> event, but maybe deadlocks that are more inviting to trip over.
> 
> I am not sure if we need to do anything about this, but we
> could make it easier to ignore events from the HSM daemon
> itself if we want to, to make the userspace implementation easier.

XFS used "invisible IO" as the mechanism for avoiding sending DMAPI
events for operations that we initiated by the HSM to move data into
and out of the filesystem.

No doubt you've heard us talk about invisible IO in the past -
O_NOCMTIME is what that invisible IO has eventually turned into in a
modern Linux kernel. We still use that for invisible IO - xfs_fsr
uses it for moving data around during online defragmentation. The
entire purpose of invisible IO was to provide a path for HSMs and
userspace bulk data movers (e.g. HSM aware backup tools like
xfsdump) to do IO without generating unnecessary or recursive DMAPI
events....

IOWs, if we want a DMAPI replacement, we will need to formalise a
method of performing syscall based operations that will not trigger
HSM notification events.

The other thing that XFS had for DMAPI was persistent storage in the
inode of the event mask that inode should report events for. See
the di_dmevmask and di_dmstate fields defined in the on-disk inode
format here:

https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc

There's no detail for them, but the event mask indicated what DMAPI
events the inode should issue notifications for, and the state field
held information about DMAPI operations in progress.

The event field is the important one here - if the event field was
empty, access to the inode never generated DMAPI events. When the
HSM moved data offline, the "READ" event mask bit was set by the HSM
and that triggered DMAPI events for any operation that needed to
read data or manipulate the extent map. When the data was brought
entirely back online, the event masks count be cleared.

However, DMAPI also supports dual state operation, where the
data in the local filesystem is also duplicated in the offline
storage (e.g. immediately after a recall operation). This state can
persist until data or layout is changed in the local filesystem,
and so there's a "WRITE" event mask as well that allows the
filesystem to inform the HSM that data it may have in offline
storage is being changed.

The state field is there to tell the HSM that an operation was in
progress when the system crashed. As part of recovery, the HSM needs
to find all the inodes that had DM operations in progress and either
complete them or revert them to bring everything back to a
consistent state. THe SGI HSMs used the bulkstat interfaces to scan
the fs and find inodes that had a non-zero DM state field. This is
one of the reasons that having bulkstat scale out to scanning
millions of inodes a second ends up being important - coherency
checking between the ondisk filesystem state and the userspace
offline data tracking databases is a very important admin
operation..

The XFS dmapi event and state mask control APIs are now deprecated.
The XFS_IOC_FSSETDM ioctl could read and write the values, and the
the XFS V1 bulkstat ioctl could read them. There were also flags for
things like extent mapping ioctls (FIEMAP equivalent) that ensured
looking at the extent map didn't trigger DMAPI events and data
recall.

I guess what I'm trying to say is that there's a lot more to an
efficient implementation of a HSM event notification mechanism than
just implementing a couple of blocking upcalls. IF we want something
that will replace even simple DMAPI-based HSM use cases, we really
need to think through how to support all the operations that a
recall operation might needed for and hence have to block. ANd we
really should think about how to efficiently filter out unnecessary
events so that we don't drown the HSM in IO events it just doesn't
need to know about....

> Another thing that might be good to do is provide an administrative
> interface to iterate and abort pending fanotify permission/pre-content
> events.

That was generally something the DMAPI event consumer provided.

> You must have noticed the overlap between my old persistent
> change tracking journal and this design. The referenced branch
> is from that old POC.
> 
> I do believe that the use cases somewhat overlap and that the
> same building blocks could be used to implement a persistent
> change journal in userspace as you suggested back then.

That's a very different use case and set of requirements to a HSM.

A HSM tracks much, much larger amounts of data than a persistent
change journal. We had [C]XFS-DMAPI based HSMs running SLES in
production that tracked half a billion inodes and > 10PB of data 15
years ago. These days I'd expect "exabyte" to be the unit of
storage that large HSMs are storing.

> Thoughts?

I think that if the goal is to support HSMs with fanotify, we first
need to think about how we efficiently support all the functionality
HSMs require rather than just focus on a blocking fanotify read
operation. We don't need to implement everything, but at least
having a plan for things like handling the event filtering
requirements without the HSM having to walk the entire filesystem
and inject per-inode event filters after every mount would be a real
good idea....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-21 23:27 ` Dave Chinner
@ 2022-09-22  4:35   ` Amir Goldstein
  2022-09-23  7:57     ` Dave Chinner
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-09-22  4:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, Plaster, Robert

On Thu, Sep 22, 2022 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Sep 11, 2022 at 09:12:06PM +0300, Amir Goldstein wrote:
> > Hi Jan,
> >
> > I wanted to consult with you about preliminary design thoughts
> > for implementing a hierarchical storage manager (HSM)
> > with fanotify.
> >
> > I have been in contact with some developers in the past
> > who were interested in using fanotify to implement HSM
> > (to replace old DMAPI implementation).
> >
> > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
> > should be enough to implement a basic HSM, but it is not
> > sufficient for implementing more advanced HSM features.
>
> Ah, I wondered where the people with that DMAPI application went all
> those years ago after I told them they should look into using
> fanotify to replace the dependency they had on the DMAPI patched
> XFS that SGI maintained for years for SLES kernels...
>

Indeed. Robert has told that the fanotify HSM code would be uploaded
to github in the near future:

https://deepspacestorage.com/resources/#downloads

> > Some of the HSM feature that I would like are:
> > - blocking hook before access to file range and fill that range
> > - blocking hook before lookup of child and optionally create child
>
> Ok, so these are to replace the DMAPI hooks that provided a blocking
> userspace upcall to the HSM to allow it fetch data from offline
> teirs that wasn't currently in the filesystem itself. i.e. the inode
> currently has a hole over that range of data, but before the read
> can proceed the HSM needs to retreive the data from the remote
> storage and write it into the local filesystem.
>
> I think that you've missed a bunch of blocking notifications that
> are needed, though. e.g. truncate needs to block while the HSM
> records the file ranges it is storing offline are now freed.
> fallocate() needs to block while it waits for the HSM to tell it the
> ranges of the file that actually contain data and so should need to
> be taken into account. (e.g. ZERO_RANGE needs to wait for offline
> data to be invalidated, COLLAPSE_RANGE needs offline data to be
> recalled just like a read() operation, etc).
>
> IOWs, any operation that manipulates the extent map or the data in
> the file needs a blocking upcall to the HSM so that it can restore
> and invalidate the offline data across the range of the operation
> that is about to be performed....

The event FAN_PRE_MODIFY mentioned below is destined to
take care of those cases.
It is destined to be called from
security_file_permission(MAY_WRITE) => fsnotify_perm()
which covers all the cases that you mentioned.

>
> > My thoughts on the UAPI were:
> > - Allow new combination of FAN_CLASS_PRE_CONTENT
> >   and FAN_REPORT_FID/DFID_NAME
> > - This combination does not allow any of the existing events
> >   in mask
> > - It Allows only new events such as FAN_PRE_ACCESS
> >   FAN_PRE_MODIFY and FAN_PRE_LOOKUP
> > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have
> >   optional file range info
> > - All the FAN_PRE_ events are called outside vfs locks and
> >   specifically before sb_writers lock as in my fsnotify_pre_modify [1]
> >   POC
> >
> > That last part is important because the HSM daemon will
> > need to make modifications to the accessed file/directory
> > before allowing the operation to proceed.
>
> Yes, and that was the biggest problem with DMAPI - the locking
> involved. DMAPI operations have to block without holding any locks
> that the IO path, truncate, fallocate, etc might need, but once they
> are unblocked they need to regain those locks to allow the operation
> to proceed. This was by far the ugliest part of the DMAPI patches,
> and ultimately, the reason why it was never merged.
>

Part of the problem is that the DMAPI hooks were inside fs code.
My intention for fanotify blocking hooks to be in vfs before taking any
locks as much as possible.

I already have a poc project that added those pre-modify hooks
for the change tracking journal thing.

For most of the syscalls, security_file_permission(MAY_WRITE) is
called before taking vfs locks (sb_writers in particular), except for
dedupe and clone and that one is my fault - 031a072a0b8a
("vfs: call vfs_clone_file_range() under freeze protection"), so
I'll need to deal with it.

> > Naturally that opens the possibility for new userspace
> > deadlocks. Nothing that is not already possible with permission
> > event, but maybe deadlocks that are more inviting to trip over.
> >
> > I am not sure if we need to do anything about this, but we
> > could make it easier to ignore events from the HSM daemon
> > itself if we want to, to make the userspace implementation easier.
>
> XFS used "invisible IO" as the mechanism for avoiding sending DMAPI
> events for operations that we initiated by the HSM to move data into
> and out of the filesystem.
>
> No doubt you've heard us talk about invisible IO in the past -
> O_NOCMTIME is what that invisible IO has eventually turned into in a
> modern Linux kernel. We still use that for invisible IO - xfs_fsr
> uses it for moving data around during online defragmentation. The
> entire purpose of invisible IO was to provide a path for HSMs and
> userspace bulk data movers (e.g. HSM aware backup tools like
> xfsdump) to do IO without generating unnecessary or recursive DMAPI
> events....
>
> IOWs, if we want a DMAPI replacement, we will need to formalise a
> method of performing syscall based operations that will not trigger
> HSM notification events.
>

This concept already exists in fanotify.
This is what FMODE_NONOTIFY is for.
The event->fd handed in FAN_ACCESS_PERM event can be use to
read the file without triggering a recursive event.
This is how Anti-Malware products scan files.

I have extended that concept in my POC patch to avoid recursive
FAN_LOOKUP_PERM event when calling Xat() syscall with
a dirfd with FMODE_NONOTIFY:

https://github.com/amir73il/linux/commits/fanotify_pre_content

> The other thing that XFS had for DMAPI was persistent storage in the
> inode of the event mask that inode should report events for. See
> the di_dmevmask and di_dmstate fields defined in the on-disk inode
> format here:
>
> https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
>
> There's no detail for them, but the event mask indicated what DMAPI
> events the inode should issue notifications for, and the state field
> held information about DMAPI operations in progress.
>
> The event field is the important one here - if the event field was
> empty, access to the inode never generated DMAPI events. When the
> HSM moved data offline, the "READ" event mask bit was set by the HSM
> and that triggered DMAPI events for any operation that needed to
> read data or manipulate the extent map. When the data was brought
> entirely back online, the event masks count be cleared.
>

HSM can and should manage this persistent bit, but it does not
have to be done using a special ioctl.
We already use xattr and/or fileattr flags in our HSM implementation.

I've mentioned this in my reply to Jan.
The way I intend to address this "persistent READ hook" in
fanotify side is by attaching an HSM specific BFP program to
an fanotify filesystem mark.

We could standartize a fileattr flag just the same as NODUMP
was for the use of backup applications (e.g. NODATA), but it
is not a prerequite, just a standard way for HSM to set the
persistent READ hook bit.

An HSM product could be configured to reappropriate NODUMP
for that matter or check for no READ bits in st_mode or xattr.

> However, DMAPI also supports dual state operation, where the
> data in the local filesystem is also duplicated in the offline
> storage (e.g. immediately after a recall operation). This state can
> persist until data or layout is changed in the local filesystem,
> and so there's a "WRITE" event mask as well that allows the
> filesystem to inform the HSM that data it may have in offline
> storage is being changed.
>
> The state field is there to tell the HSM that an operation was in
> progress when the system crashed. As part of recovery, the HSM needs
> to find all the inodes that had DM operations in progress and either
> complete them or revert them to bring everything back to a
> consistent state. THe SGI HSMs used the bulkstat interfaces to scan
> the fs and find inodes that had a non-zero DM state field. This is
> one of the reasons that having bulkstat scale out to scanning
> millions of inodes a second ends up being important - coherency
> checking between the ondisk filesystem state and the userspace
> offline data tracking databases is a very important admin
> operation..
>

Normally, HSM will be listening on a filesystem mark to async
FAN_MODIFY and FAN_CLOSE_WRITE events.

To cover the case of crash and missing fanotify events, we use
the persistent change tracking journal.
My current prototype is in overlayfs driver using pre-modify
fsnotify hooks, as we discussed back in LSFMM 2018:

https://github.com/amir73il/overlayfs/wiki/Overlayfs-watch

The idea is to export those pre-modify hooks via fanotify
and move this implementation from the overlayfs driver to
userspace HSM daemon.

Note that the name "Change Tracking Journal" may be confusing -
My implementation does not store a time sequence of events like
the NTFS Change Journal - it only stores a map of file handles of
directories containing new/changed/deleted files.
Iterating this "Changed dirs map" is way faster then itereating
bulkstat of all inodes and looking for the WRITE bit.

The responsibility of maintaining per file "dirty" state is on HSM
and it can be done using the change tracking journal and an
external database. Filesystem provided features such as ctime
and iversion can be used to optimize the management of "dirty"
state, but they are not a prerequisite and most of the time the
change journal is sufficient to be able to scale up, because it
can give you the answer to the question:
"In which of the multi million dirs, do I need to look for changes
 to be synced to secondary storage?".

> The XFS dmapi event and state mask control APIs are now deprecated.
> The XFS_IOC_FSSETDM ioctl could read and write the values, and the
> the XFS V1 bulkstat ioctl could read them. There were also flags for
> things like extent mapping ioctls (FIEMAP equivalent) that ensured
> looking at the extent map didn't trigger DMAPI events and data
> recall.
>
> I guess what I'm trying to say is that there's a lot more to an
> efficient implementation of a HSM event notification mechanism than
> just implementing a couple of blocking upcalls. IF we want something
> that will replace even simple DMAPI-based HSM use cases, we really
> need to think through how to support all the operations that a
> recall operation might needed for and hence have to block. ANd we
> really should think about how to efficiently filter out unnecessary
> events so that we don't drown the HSM in IO events it just doesn't
> need to know about....
>

Thinking about efficient HSM implementation and testing prototypes is
what I have been doing for the past 6 years in CTERA.

My thoughts and design for fanotify HSM can be backed up with several
successful prototypes that have been deployed on large scale
customer environments where both high performance and reliable
backups are a very hard requirements.

> > Another thing that might be good to do is provide an administrative
> > interface to iterate and abort pending fanotify permission/pre-content
> > events.
>
> That was generally something the DMAPI event consumer provided.
>
> > You must have noticed the overlap between my old persistent
> > change tracking journal and this design. The referenced branch
> > is from that old POC.
> >
> > I do believe that the use cases somewhat overlap and that the
> > same building blocks could be used to implement a persistent
> > change journal in userspace as you suggested back then.
>
> That's a very different use case and set of requirements to a HSM.
>
> A HSM tracks much, much larger amounts of data than a persistent
> change journal. We had [C]XFS-DMAPI based HSMs running SLES in
> production that tracked half a billion inodes and > 10PB of data 15
> years ago. These days I'd expect "exabyte" to be the unit of
> storage that large HSMs are storing.
>
> > Thoughts?
>
> I think that if the goal is to support HSMs with fanotify, we first
> need to think about how we efficiently support all the functionality
> HSMs require rather than just focus on a blocking fanotify read
> operation. We don't need to implement everything, but at least
> having a plan for things like handling the event filtering
> requirements without the HSM having to walk the entire filesystem
> and inject per-inode event filters after every mount would be a real
> good idea....
>

All of the above have been considered and I have mentioned
the proposed solution in the thread.

I may post some partial RFC patches to hash out some
implementation details along the way (e.g. for FAN_LOOKUP_PERM),
but when I send out the final proposal it will include all the fanotify
extensions required to implement a fully functional and performant HSM.

Given the number and lengths of Q&A in this thread I am probably
going to summarize this discussion in a wiki to send along with the
proposal for fanotify HSM API.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-20 18:19           ` Amir Goldstein
@ 2022-09-22 10:48             ` Jan Kara
  2022-09-22 13:03               ` Amir Goldstein
  2022-10-28 12:50               ` Amir Goldstein
  0 siblings, 2 replies; 43+ messages in thread
From: Jan Kara @ 2022-09-22 10:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Plaster, Robert, David Howells,
	linux-fsdevel, Fufu Fang

On Tue 20-09-22 21:19:25, Amir Goldstein wrote:
> On Wed, Sep 14, 2022 at 2:52 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > > > So I'd prefer to avoid the major API
> > > > > > extension unless there are serious users out there - perhaps we will even
> > > > > > need to develop the kernel API in cooperation with the userspace part to
> > > > > > verify the result is actually usable and useful.
> > > >
> > > > Yap. It should be trivial to implement a "mirror" HSM backend.
> > > > For example, the libprojfs [5] projects implements a MirrorProvider
> > > > backend for the Microsoft ProjFS [6] HSM API.
> > >
> > > Well, validating that things work using some simple backend is one thing
> > > but we are probably also interested in whether the result is practical to
> > > use - i.e., whether the performance meets the needs, whether the API is not
> > > cumbersome for what HSM solutions need to do, whether the more advanced
> > > features like range-support are useful the way they are implemented etc.
> > > We can verify some of these things with simple mirror HSM backend but I'm
> > > afraid some of the problems may become apparent only once someone actually
> > > uses the result in practice and for that we need a userspace counterpart
> > > that does actually something useful so that people have motivation to use
> > > it :).
> >
> 
> Hi Jan,
> 
> I wanted to give an update on the POC that I am working on.
> I decided to find a FUSE HSM and show how it may be converted
> to use fanotify HSM hooks.
> 
> HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local
> cache from a remote http on first access to every directory and file range.
> 
> Normally, it would be run like this:
> ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/
> 
> Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements
> passthrough calls to the local cache dir if cache is already populated.
> 
> After my conversion patches [1], this download-only HSM can be run like
> this without mounting FUSE:
> 
> sudo ./httpdirfs --fanotify --cache-location /vdf/cache
> https://cdn.kernel.org/pub/ -
> 
> [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content
> 
> Browsing the cache directory at /vdf/cache, lazyly populates the local cache
> using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content
> using FAN_ACCESS_PERM read hooks.
> 
> Up to this point, the implementation did not require any kernel changes.
> However, this type of command does not populate the path components,
> because lookup does not generate FAN_ACCESS_PERM event:
> 
> stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz
> 
> To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM
> event [2] and used it to lazyly populate directories in the path ancestry.
> For now, I stuck with the XXX_PERM convention and did not require
> FAN_CLASS_PRE_CONTENT, although we probably should.
> 
> [2] https://github.com/amir73il/linux/commits/fanotify_pre_content
> 
> Streaming read of large files works as well, but only for sequential read
> patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events
> do not (yet) carry range info, so my naive implementation downloads
> one extra data chunk on each FAN_ACCESS_PERM until the cache file is full.
> 
> This makes it possible to run commands like:
> 
> tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz
> | less
> 
> without having to wait for the entire 400MB file to download before
> seeing the first page.
> 
> This streaming feature is extremely important for modern HSMs
> that are often used to archive large media files in the cloud.

Thanks for update Amir! I've glanced through the series and so far it looks
pretty simple and I'd have only some style / readability nits (but let's
resolve those once we have something more complete).

When thinking about HSM (and while following your discussion with Dave) I
wondered about one thing: When the notifications happen before we take
locks, then we are in principle prone to time-to-check-time-to-use races,
aren't we? How are these resolved?

For example something like:
We have file with size 16k.
Reader:				Writer
  read 8k at offset 12k
    -> notification sent
    - HSM makes sure 12-16k is here and 16-20k is beyond eof so nothing to do

				expand file to 20k
  - now the file contents must not get moved out until the reader is
    done in order not to break it

> For the next steps of POC, I could do:
> - Report FAN_ACCESS_PERM range info to implement random read
>   patterns (e.g. unzip -l)
> - Introduce FAN_MODIFY_PERM, so file content could be downloaded
>   before modifying a read-write HSM cache
> - Demo conversion of a read-write FUSE HSM implementation
>   (e.g. https://github.com/volga629/davfs2)
> - Demo HSM with filesystem mark [*] and a hardcoded test filter
> 
> [*] Note that unlike the case with recursive inotify, this POC HSM
> implementation is not racy, because of the lookup permission events.
> A filesystem mark is still needed to avoid pinning all the unpopulated
> cache tree leaf entries to inode cache, so that this HSM could work on
> a very large scale tree, the same as my original use case for implementing
> filesystem mark.

Sounds good! Just with your concern about pinning - can't you use evictable
marks added on lookup for files / dirs you want to track? Maybe it isn't
great design for other reasons but it would save you some event
filtering...

> If what you are looking for is an explanation why fanotify HSM would be better
> than a FUSE HSM implementation then there are several reasons.
> Performance is at the top of the list. There is this famous USENIX paper [3]
> about FUSE passthrough performance.
> It is a bit outdated, but many parts are still relevant - you can ask
> the Android
> developers why they decided to work on FUSE-BFP...
> 
> [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf
> 
> For me, performance is one of the main concerns, but not the only one,
> so I am not entirely convinced that a full FUSE-BFP implementation would
> solve all my problems.
> 
> When scaling to many millions of passthrough inodes, resource usage start
> becoming a limitation of a FUSE passthrough implementation and memory
> reclaim of native fs works a lot better than memory reclaim over FUSE over
> another native fs.
> 
> When the workload works on the native filesystem, it is also possible to
> use native fs features (e.g. XFS ioctls).

OK, understood. Out of curiosity you've mentioned you'd looked into
implementing HSM in overlayfs. What are the issues there? I assume
performance is very close to native one so that is likely not an issue and
resource usage you mention above likely is not that bad either. So I guess
it is that you don't want to invent hooks for userspace for moving (parts
of) files between offline storage and the local cache?

> Questions:
> - What do you think about the direction this POC has taken so far?
> - Is there anything specific that you would like to see in the POC
>   to be convinced that this API will be useful?

I think your POC is taking a good direction and your discussion with Dave
had made me more confident that this is all workable :). I liked your idea
of the wiki (or whatever form of documentation) that summarizes what we've
discussed in this thread. That would be actually pretty nice for future
reference.

The remaining concern I have is that we should demonstrate the solution is
able to scale to millions of inodes (and likely more) because AFAIU that
are the sizes current HSM solutions are interested in. I guess this is kind
of covered in your last step of POCs though.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-22 10:48             ` Jan Kara
@ 2022-09-22 13:03               ` Amir Goldstein
  2022-09-26 15:27                 ` Jan Kara
  2022-10-28 12:50               ` Amir Goldstein
  1 sibling, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-09-22 13:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel, Fufu Fang

On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 20-09-22 21:19:25, Amir Goldstein wrote:
> > On Wed, Sep 14, 2022 at 2:52 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > > > > > So I'd prefer to avoid the major API
> > > > > > > extension unless there are serious users out there - perhaps we will even
> > > > > > > need to develop the kernel API in cooperation with the userspace part to
> > > > > > > verify the result is actually usable and useful.
> > > > >
> > > > > Yap. It should be trivial to implement a "mirror" HSM backend.
> > > > > For example, the libprojfs [5] projects implements a MirrorProvider
> > > > > backend for the Microsoft ProjFS [6] HSM API.
> > > >
> > > > Well, validating that things work using some simple backend is one thing
> > > > but we are probably also interested in whether the result is practical to
> > > > use - i.e., whether the performance meets the needs, whether the API is not
> > > > cumbersome for what HSM solutions need to do, whether the more advanced
> > > > features like range-support are useful the way they are implemented etc.
> > > > We can verify some of these things with simple mirror HSM backend but I'm
> > > > afraid some of the problems may become apparent only once someone actually
> > > > uses the result in practice and for that we need a userspace counterpart
> > > > that does actually something useful so that people have motivation to use
> > > > it :).
> > >
> >
> > Hi Jan,
> >
> > I wanted to give an update on the POC that I am working on.
> > I decided to find a FUSE HSM and show how it may be converted
> > to use fanotify HSM hooks.
> >
> > HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local
> > cache from a remote http on first access to every directory and file range.
> >
> > Normally, it would be run like this:
> > ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/
> >
> > Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements
> > passthrough calls to the local cache dir if cache is already populated.
> >
> > After my conversion patches [1], this download-only HSM can be run like
> > this without mounting FUSE:
> >
> > sudo ./httpdirfs --fanotify --cache-location /vdf/cache
> > https://cdn.kernel.org/pub/ -
> >
> > [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content
> >
> > Browsing the cache directory at /vdf/cache, lazyly populates the local cache
> > using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content
> > using FAN_ACCESS_PERM read hooks.
> >
> > Up to this point, the implementation did not require any kernel changes.
> > However, this type of command does not populate the path components,
> > because lookup does not generate FAN_ACCESS_PERM event:
> >
> > stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz
> >
> > To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM
> > event [2] and used it to lazyly populate directories in the path ancestry.
> > For now, I stuck with the XXX_PERM convention and did not require
> > FAN_CLASS_PRE_CONTENT, although we probably should.
> >
> > [2] https://github.com/amir73il/linux/commits/fanotify_pre_content
> >
> > Streaming read of large files works as well, but only for sequential read
> > patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events
> > do not (yet) carry range info, so my naive implementation downloads
> > one extra data chunk on each FAN_ACCESS_PERM until the cache file is full.
> >
> > This makes it possible to run commands like:
> >
> > tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz
> > | less
> >
> > without having to wait for the entire 400MB file to download before
> > seeing the first page.
> >
> > This streaming feature is extremely important for modern HSMs
> > that are often used to archive large media files in the cloud.
>
> Thanks for update Amir! I've glanced through the series and so far it looks
> pretty simple and I'd have only some style / readability nits (but let's
> resolve those once we have something more complete).
>
> When thinking about HSM (and while following your discussion with Dave) I
> wondered about one thing: When the notifications happen before we take
> locks, then we are in principle prone to time-to-check-time-to-use races,
> aren't we? How are these resolved?
>
> For example something like:
> We have file with size 16k.
> Reader:                         Writer
>   read 8k at offset 12k
>     -> notification sent
>     - HSM makes sure 12-16k is here and 16-20k is beyond eof so nothing to do
>
>                                 expand file to 20k
>   - now the file contents must not get moved out until the reader is
>     done in order not to break it
>

Good question.
The way I was considering to resolve this is that for evicting file
content, HSM would:

1. Try to take a write lease (F_SETLEASE) on the file
    to make sure no other users are accessing the file
2. Set the mark or persistent NODATA bit on the file
3. Drop the write lease
4. Evict file content

You can think of it as upgrading from a breakable write lease
to an unbreakable write lease.

This way, users cannot access or modify file content without
either triggering FAN_ACCESS_PERM/FAN_MODIFY_PERM
or preventing HSM from evicting file content until close().

And this fix in linux-next:
https://lore.kernel.org/linux-fsdevel/20220816145317.710368-1-amir73il@gmail.com/
makes sure that those users cannot open a file without either
triggering FAN_OPEN_PERM or preventing HSM from evicting file
content until close().

HSM signal handler for lease break would be rare and only needs to
drop the lease. If lease is dropped before the mark was successfully set,
abort the content evict.

I hope I got this right - this is a design that I did not POC yet,
but it is on my TODO.

BTW, evicting directory content is not something that I have on my radar
at the moment. Not sure how many HSMs do that anyway.
FWIW, it was never part of DMAPI.

> > For the next steps of POC, I could do:
> > - Report FAN_ACCESS_PERM range info to implement random read
> >   patterns (e.g. unzip -l)
> > - Introduce FAN_MODIFY_PERM, so file content could be downloaded
> >   before modifying a read-write HSM cache
> > - Demo conversion of a read-write FUSE HSM implementation
> >   (e.g. https://github.com/volga629/davfs2)
> > - Demo HSM with filesystem mark [*] and a hardcoded test filter
> >
> > [*] Note that unlike the case with recursive inotify, this POC HSM
> > implementation is not racy, because of the lookup permission events.
> > A filesystem mark is still needed to avoid pinning all the unpopulated
> > cache tree leaf entries to inode cache, so that this HSM could work on
> > a very large scale tree, the same as my original use case for implementing
> > filesystem mark.
>
> Sounds good! Just with your concern about pinning - can't you use evictable
> marks added on lookup for files / dirs you want to track? Maybe it isn't
> great design for other reasons but it would save you some event
> filtering...
>

With the current POC, there is no trigger to re-establish the evicted mark,
because the parent is already populated and has no mark.

A hook on instantiate of inode in inode cache could fill that gap.
It could still be useful to filter FAN_INSTANTIATE_PERM events in the
kernel but it is not a must because instantiate is more rare than (say) lookup
and then the fast lookup path (RCU walk) on populated trees suffers almost
no overhead when the filesystem is watched.

Please think about this and let me know if you think that this is a direction
worth pursuing, now, or as a later optimization.

> > If what you are looking for is an explanation why fanotify HSM would be better
> > than a FUSE HSM implementation then there are several reasons.
> > Performance is at the top of the list. There is this famous USENIX paper [3]
> > about FUSE passthrough performance.
> > It is a bit outdated, but many parts are still relevant - you can ask
> > the Android
> > developers why they decided to work on FUSE-BFP...
> >
> > [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf
> >
> > For me, performance is one of the main concerns, but not the only one,
> > so I am not entirely convinced that a full FUSE-BFP implementation would
> > solve all my problems.
> >
> > When scaling to many millions of passthrough inodes, resource usage start
> > becoming a limitation of a FUSE passthrough implementation and memory
> > reclaim of native fs works a lot better than memory reclaim over FUSE over
> > another native fs.
> >
> > When the workload works on the native filesystem, it is also possible to
> > use native fs features (e.g. XFS ioctls).
>
> OK, understood. Out of curiosity you've mentioned you'd looked into
> implementing HSM in overlayfs. What are the issues there? I assume
> performance is very close to native one so that is likely not an issue and
> resource usage you mention above likely is not that bad either. So I guess
> it is that you don't want to invent hooks for userspace for moving (parts
> of) files between offline storage and the local cache?
>

In a nutshell, when realizing that overlayfs needs userspace hooks
to cater HSM, it becomes quite useless to use a stacked fs design.

Performance is not a problem with overlayfs, but like with FUSE,
all the inodes/dentries in the system double, memory reclaim
of layered fs becomes an awkward dance, which messes with the
special logic of xfs shrinkers, and on top of all this, overlayfs does
not proxy all the XFS ioctls either.

The fsnotify hooks are a much better design when realizing that
the likely() case is to do nothing and incur least overhead and
the unlikely() case of user hook is rare.

> > Questions:
> > - What do you think about the direction this POC has taken so far?
> > - Is there anything specific that you would like to see in the POC
> >   to be convinced that this API will be useful?
>
> I think your POC is taking a good direction and your discussion with Dave
> had made me more confident that this is all workable :). I liked your idea
> of the wiki (or whatever form of documentation) that summarizes what we've
> discussed in this thread. That would be actually pretty nice for future
> reference.

Yeh, I need that wiki to organize my thoughts as well ;)

>
> The remaining concern I have is that we should demonstrate the solution is
> able to scale to millions of inodes (and likely more) because AFAIU that
> are the sizes current HSM solutions are interested in. I guess this is kind
> of covered in your last step of POCs though.
>

Well, in $WORK we have performance test setups for those workloads,
so part of my plan is to convert the in-house FUSE HSM
to fanotify and make sure that all those tests do not regress.
But that is not code, nor tests that I can release, I can only report back
that the POC works and show the building blocks that I used on
some open source code base.

I plan to do the open source small scale POC first to show the
building blocks so you could imagine the end results and
then take the building blocks for a test drive in the real world.

I've put my eye on davfs2 [1] as the code base for read-write HSM
POC, but maybe I will find an S3 FUSE fs that could work too
I am open to other suggestions.

[1] https://github.com/volga629/davfs2

When DeepSpace Storage releases their product to github,
I will be happy to work with them on a POC with their code
base and I bet they could arrange a large scale test setup.
(hint hint).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-22  4:35   ` Amir Goldstein
@ 2022-09-23  7:57     ` Dave Chinner
  2022-09-23 11:22       ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2022-09-23  7:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Plaster, Robert

On Thu, Sep 22, 2022 at 07:35:39AM +0300, Amir Goldstein wrote:
> On Thu, Sep 22, 2022 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sun, Sep 11, 2022 at 09:12:06PM +0300, Amir Goldstein wrote:
> > > Hi Jan,
> > >
> > > I wanted to consult with you about preliminary design thoughts
> > > for implementing a hierarchical storage manager (HSM)
> > > with fanotify.
> > >
> > > I have been in contact with some developers in the past
> > > who were interested in using fanotify to implement HSM
> > > (to replace old DMAPI implementation).
> > >
> > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
> > > should be enough to implement a basic HSM, but it is not
> > > sufficient for implementing more advanced HSM features.
> >
> > Ah, I wondered where the people with that DMAPI application went all
> > those years ago after I told them they should look into using
> > fanotify to replace the dependency they had on the DMAPI patched
> > XFS that SGI maintained for years for SLES kernels...
> >
> 
> Indeed. Robert has told that the fanotify HSM code would be uploaded
> to github in the near future:
> 
> https://deepspacestorage.com/resources/#downloads
> 
> > > Some of the HSM feature that I would like are:
> > > - blocking hook before access to file range and fill that range
> > > - blocking hook before lookup of child and optionally create child
> >
> > Ok, so these are to replace the DMAPI hooks that provided a blocking
> > userspace upcall to the HSM to allow it fetch data from offline
> > teirs that wasn't currently in the filesystem itself. i.e. the inode
> > currently has a hole over that range of data, but before the read
> > can proceed the HSM needs to retreive the data from the remote
> > storage and write it into the local filesystem.
> >
> > I think that you've missed a bunch of blocking notifications that
> > are needed, though. e.g. truncate needs to block while the HSM
> > records the file ranges it is storing offline are now freed.
> > fallocate() needs to block while it waits for the HSM to tell it the
> > ranges of the file that actually contain data and so should need to
> > be taken into account. (e.g. ZERO_RANGE needs to wait for offline
> > data to be invalidated, COLLAPSE_RANGE needs offline data to be
> > recalled just like a read() operation, etc).
> >
> > IOWs, any operation that manipulates the extent map or the data in
> > the file needs a blocking upcall to the HSM so that it can restore
> > and invalidate the offline data across the range of the operation
> > that is about to be performed....
> 
> The event FAN_PRE_MODIFY mentioned below is destined to
> take care of those cases.
> It is destined to be called from
> security_file_permission(MAY_WRITE) => fsnotify_perm()
> which covers all the cases that you mentioned.

That doesn't provide IO ranges to userspace, so does that mean
write(1 byte) might trigger the recall of an entire TB sized file
from offline storage?

How do you support partial file migration with a scheme like this?

> > > My thoughts on the UAPI were:
> > > - Allow new combination of FAN_CLASS_PRE_CONTENT
> > >   and FAN_REPORT_FID/DFID_NAME
> > > - This combination does not allow any of the existing events
> > >   in mask
> > > - It Allows only new events such as FAN_PRE_ACCESS
> > >   FAN_PRE_MODIFY and FAN_PRE_LOOKUP
> > > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have
> > >   optional file range info
> > > - All the FAN_PRE_ events are called outside vfs locks and
> > >   specifically before sb_writers lock as in my fsnotify_pre_modify [1]
> > >   POC
> > >
> > > That last part is important because the HSM daemon will
> > > need to make modifications to the accessed file/directory
> > > before allowing the operation to proceed.
> >
> > Yes, and that was the biggest problem with DMAPI - the locking
> > involved. DMAPI operations have to block without holding any locks
> > that the IO path, truncate, fallocate, etc might need, but once they
> > are unblocked they need to regain those locks to allow the operation
> > to proceed. This was by far the ugliest part of the DMAPI patches,
> > and ultimately, the reason why it was never merged.
> 
> Part of the problem is that the DMAPI hooks were inside fs code.
> My intention for fanotify blocking hooks to be in vfs before taking any
> locks as much as possible.
>
> I already have a poc project that added those pre-modify hooks
> for the change tracking journal thing.
> 
> For most of the syscalls, security_file_permission(MAY_WRITE) is
> called before taking vfs locks (sb_writers in particular), except for
> dedupe and clone and that one is my fault - 031a072a0b8a
> ("vfs: call vfs_clone_file_range() under freeze protection"), so
> I'll need to deal with it.

This looks like a farly large TOCTOU race condition.  i.e. what's to
prevent a "move offline" operation run by a userspace HSM agent
racing with an application read that has done it's FAN_PRE_MODIFY
call but hasn't yet started/completed it's physical read from the
filesystem? This sort of race can result in the data in the file
being migrated and punched out by the HSM before the read gets the
i_rwsem or page locks it needs to read the data into cache....

What am I missing here? How does this "no filesystem locks"
notification scheme avoid these sorts of spurious user data
corruption events from occurring?

> > > Naturally that opens the possibility for new userspace
> > > deadlocks. Nothing that is not already possible with permission
> > > event, but maybe deadlocks that are more inviting to trip over.
> > >
> > > I am not sure if we need to do anything about this, but we
> > > could make it easier to ignore events from the HSM daemon
> > > itself if we want to, to make the userspace implementation easier.
> >
> > XFS used "invisible IO" as the mechanism for avoiding sending DMAPI
> > events for operations that we initiated by the HSM to move data into
> > and out of the filesystem.
> >
> > No doubt you've heard us talk about invisible IO in the past -
> > O_NOCMTIME is what that invisible IO has eventually turned into in a
> > modern Linux kernel. We still use that for invisible IO - xfs_fsr
> > uses it for moving data around during online defragmentation. The
> > entire purpose of invisible IO was to provide a path for HSMs and
> > userspace bulk data movers (e.g. HSM aware backup tools like
> > xfsdump) to do IO without generating unnecessary or recursive DMAPI
> > events....
> >
> > IOWs, if we want a DMAPI replacement, we will need to formalise a
> > method of performing syscall based operations that will not trigger
> > HSM notification events.
> 
> This concept already exists in fanotify.
> This is what FMODE_NONOTIFY is for.
> The event->fd handed in FAN_ACCESS_PERM event can be use to
> read the file without triggering a recursive event.
> This is how Anti-Malware products scan files.

So all that you need is to enable invisible IO for the data
migration operations, then?

BTW, where does the writable fd that is passed to the userspace
event that the HSM can use to do IO to/from the inode come from? How
does fanotify guarantee that the recipient has the necessary
permissions to read/write to the file it represents? 

> I have extended that concept in my POC patch to avoid recursive
> FAN_LOOKUP_PERM event when calling Xat() syscall with
> a dirfd with FMODE_NONOTIFY:
> 
> https://github.com/amir73il/linux/commits/fanotify_pre_content
> 
> > The other thing that XFS had for DMAPI was persistent storage in the
> > inode of the event mask that inode should report events for. See
> > the di_dmevmask and di_dmstate fields defined in the on-disk inode
> > format here:
> >
> > https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
> >
> > There's no detail for them, but the event mask indicated what DMAPI
> > events the inode should issue notifications for, and the state field
> > held information about DMAPI operations in progress.
> >
> > The event field is the important one here - if the event field was
> > empty, access to the inode never generated DMAPI events. When the
> > HSM moved data offline, the "READ" event mask bit was set by the HSM
> > and that triggered DMAPI events for any operation that needed to
> > read data or manipulate the extent map. When the data was brought
> > entirely back online, the event masks count be cleared.
> >
> 
> HSM can and should manage this persistent bit, but it does not
> have to be done using a special ioctl.
> We already use xattr and/or fileattr flags in our HSM implementation.
> 
> I've mentioned this in my reply to Jan.
> The way I intend to address this "persistent READ hook" in
> fanotify side is by attaching an HSM specific BFP program to
> an fanotify filesystem mark.

How is that fanotify mark and BPF hook set? What triggers it?
What makes it persistent? Why does it need a BPF program? What is
the program actually intended to do?

You're making big assumptions that people actually know what you are
trying to implement, also how and why. Perhaps you should write a
design overview that connects all the dots and explains why things
like custom BPF programs are necessary...

> We could standartize a fileattr flag just the same as NODUMP
> was for the use of backup applications (e.g. NODATA), but it
> is not a prerequite, just a standard way for HSM to set the
> persistent READ hook bit.

So you do actually need the filesystem to store information about
HSM events that need to be generated, but then all the events are
generated at the VFS without holding filesysetm locks? That seems
like it could be racy, too.

> An HSM product could be configured to reappropriate NODUMP
> for that matter or check for no READ bits in st_mode or xattr.

You know better than to try to redefine the meaning of a well known
attribute that has been exposed to userspace for 25+ years and is
widely used...

> > However, DMAPI also supports dual state operation, where the
> > data in the local filesystem is also duplicated in the offline
> > storage (e.g. immediately after a recall operation). This state can
> > persist until data or layout is changed in the local filesystem,
> > and so there's a "WRITE" event mask as well that allows the
> > filesystem to inform the HSM that data it may have in offline
> > storage is being changed.
> >
> > The state field is there to tell the HSM that an operation was in
> > progress when the system crashed. As part of recovery, the HSM needs
> > to find all the inodes that had DM operations in progress and either
> > complete them or revert them to bring everything back to a
> > consistent state. THe SGI HSMs used the bulkstat interfaces to scan
> > the fs and find inodes that had a non-zero DM state field. This is
> > one of the reasons that having bulkstat scale out to scanning
> > millions of inodes a second ends up being important - coherency
> > checking between the ondisk filesystem state and the userspace
> > offline data tracking databases is a very important admin
> > operation..
> >
> 
> Normally, HSM will be listening on a filesystem mark to async
> FAN_MODIFY and FAN_CLOSE_WRITE events.
> 
> To cover the case of crash and missing fanotify events, we use
> the persistent change tracking journal.
> My current prototype is in overlayfs driver using pre-modify
> fsnotify hooks, as we discussed back in LSFMM 2018:
> 
> https://github.com/amir73il/overlayfs/wiki/Overlayfs-watch

Yup, which lead to the conclusion that the only way it could work
was "synchronous write of the change intent to the change journal
before the change is made".

> The idea is to export those pre-modify hooks via fanotify
> and move this implementation from the overlayfs driver to
> userspace HSM daemon.
> 
> Note that the name "Change Tracking Journal" may be confusing -
> My implementation does not store a time sequence of events like
> the NTFS Change Journal - it only stores a map of file handles of
> directories containing new/changed/deleted files.
>
> Iterating this "Changed dirs map" is way faster then itereating
> bulkstat of all inodes and looking for the WRITE bit.

Maybe so, but you haven't ever provided data to indicate what sort
of filesystem modification rate this change journal can sustain.
Just from my understanding of what it needs to do (sync updates), I
can't see how it could be implemented without affecting overall
runtime file and directory modification rates adversely.

At which point I ponder what is worse - always taking the runtime
overhead for the change journal operations to be propagated to the
HSM to record persistently and then replaying that journal after a
mount, or the HSM just needing to run a bulkstat scan in the
background after mount to do a coherency check and resolution scan?

> The responsibility of maintaining per file "dirty" state is on HSM
> and it can be done using the change tracking journal and an
> external database. Filesystem provided features such as ctime
> and iversion can be used to optimize the management of "dirty"
> state, but they are not a prerequisite and most of the time the
> change journal is sufficient to be able to scale up, because it
> can give you the answer to the question:
> "In which of the multi million dirs, do I need to look for changes
>  to be synced to secondary storage?".

Sure, it's not every file that gets recorded, but it still sounds
like this could be a significant percentage of the filesystem
contents that need scanning. And the HSM would still have to scan
them by the directory structure operations, which typically costs
about 3-4x as much IO, CPU and wall time overhead per scanned inode
compared to bulkstat.

Please note that I'm not saying that either mechanism is better, nor
am I saying that bulkstat scanning is a perfect solution, just that
I've got zero information about how this change journal performs and
what impact it has on runtime operation. Bulkstat scans, OTOH, I
understand intimately and especially in the context of HSM
operations. i.e. I know that scanning a billion inodes only takes a
few minutes of wall time with modern SSDs and CPUs...

> > The XFS dmapi event and state mask control APIs are now deprecated.
> > The XFS_IOC_FSSETDM ioctl could read and write the values, and the
> > the XFS V1 bulkstat ioctl could read them. There were also flags for
> > things like extent mapping ioctls (FIEMAP equivalent) that ensured
> > looking at the extent map didn't trigger DMAPI events and data
> > recall.
> >
> > I guess what I'm trying to say is that there's a lot more to an
> > efficient implementation of a HSM event notification mechanism than
> > just implementing a couple of blocking upcalls. IF we want something
> > that will replace even simple DMAPI-based HSM use cases, we really
> > need to think through how to support all the operations that a
> > recall operation might needed for and hence have to block. ANd we
> > really should think about how to efficiently filter out unnecessary
> > events so that we don't drown the HSM in IO events it just doesn't
> > need to know about....
> 
> Thinking about efficient HSM implementation and testing prototypes is
> what I have been doing for the past 6 years in CTERA.

Great!

It sounds like you'll have no trouble documenting the design to
teach us all how you've solved these problems so we understand what
you are asking us to review... :)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-23  7:57     ` Dave Chinner
@ 2022-09-23 11:22       ` Amir Goldstein
  0 siblings, 0 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-09-23 11:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, Plaster, Robert

 w


On Fri, Sep 23, 2022 at 10:57 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Sep 22, 2022 at 07:35:39AM +0300, Amir Goldstein wrote:
> > On Thu, Sep 22, 2022 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Sun, Sep 11, 2022 at 09:12:06PM +0300, Amir Goldstein wrote:
> > > > Hi Jan,
> > > >
> > > > I wanted to consult with you about preliminary design thoughts
> > > > for implementing a hierarchical storage manager (HSM)
> > > > with fanotify.
> > > >
> > > > I have been in contact with some developers in the past
> > > > who were interested in using fanotify to implement HSM
> > > > (to replace old DMAPI implementation).
> > > >
> > > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM
> > > > should be enough to implement a basic HSM, but it is not
> > > > sufficient for implementing more advanced HSM features.
> > >
> > > Ah, I wondered where the people with that DMAPI application went all
> > > those years ago after I told them they should look into using
> > > fanotify to replace the dependency they had on the DMAPI patched
> > > XFS that SGI maintained for years for SLES kernels...
> > >
> >
> > Indeed. Robert has told that the fanotify HSM code would be uploaded
> > to github in the near future:
> >
> > https://deepspacestorage.com/resources/#downloads
> >
> > > > Some of the HSM feature that I would like are:
> > > > - blocking hook before access to file range and fill that range
> > > > - blocking hook before lookup of child and optionally create child
> > >
> > > Ok, so these are to replace the DMAPI hooks that provided a blocking
> > > userspace upcall to the HSM to allow it fetch data from offline
> > > teirs that wasn't currently in the filesystem itself. i.e. the inode
> > > currently has a hole over that range of data, but before the read
> > > can proceed the HSM needs to retreive the data from the remote
> > > storage and write it into the local filesystem.
> > >
> > > I think that you've missed a bunch of blocking notifications that
> > > are needed, though. e.g. truncate needs to block while the HSM
> > > records the file ranges it is storing offline are now freed.
> > > fallocate() needs to block while it waits for the HSM to tell it the
> > > ranges of the file that actually contain data and so should need to
> > > be taken into account. (e.g. ZERO_RANGE needs to wait for offline
> > > data to be invalidated, COLLAPSE_RANGE needs offline data to be
> > > recalled just like a read() operation, etc).
> > >
> > > IOWs, any operation that manipulates the extent map or the data in
> > > the file needs a blocking upcall to the HSM so that it can restore
> > > and invalidate the offline data across the range of the operation
> > > that is about to be performed....
> >
> > The event FAN_PRE_MODIFY mentioned below is destined to
> > take care of those cases.
> > It is destined to be called from
> > security_file_permission(MAY_WRITE) => fsnotify_perm()
> > which covers all the cases that you mentioned.
>
> That doesn't provide IO ranges to userspace, so does that mean
> write(1 byte) might trigger the recall of an entire TB sized file
> from offline storage?
>
> How do you support partial file migration with a scheme like this?

That would be handled by:
- FAN_PRE_ACCESS and FAN_PRE_MODIFY can have
  optional file range info

In the API extension proposal below

>
> > > > My thoughts on the UAPI were:
> > > > - Allow new combination of FAN_CLASS_PRE_CONTENT
> > > >   and FAN_REPORT_FID/DFID_NAME
> > > > - This combination does not allow any of the existing events
> > > >   in mask
> > > > - It Allows only new events such as FAN_PRE_ACCESS
> > > >   FAN_PRE_MODIFY and FAN_PRE_LOOKUP
> > > > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have
> > > >   optional file range info
> > > > - All the FAN_PRE_ events are called outside vfs locks and
> > > >   specifically before sb_writers lock as in my fsnotify_pre_modify [1]
> > > >   POC
> > > >
> > > > That last part is important because the HSM daemon will
> > > > need to make modifications to the accessed file/directory
> > > > before allowing the operation to proceed.
> > >
> > > Yes, and that was the biggest problem with DMAPI - the locking
> > > involved. DMAPI operations have to block without holding any locks
> > > that the IO path, truncate, fallocate, etc might need, but once they
> > > are unblocked they need to regain those locks to allow the operation
> > > to proceed. This was by far the ugliest part of the DMAPI patches,
> > > and ultimately, the reason why it was never merged.
> >
> > Part of the problem is that the DMAPI hooks were inside fs code.
> > My intention for fanotify blocking hooks to be in vfs before taking any
> > locks as much as possible.
> >
> > I already have a poc project that added those pre-modify hooks
> > for the change tracking journal thing.
> >
> > For most of the syscalls, security_file_permission(MAY_WRITE) is
> > called before taking vfs locks (sb_writers in particular), except for
> > dedupe and clone and that one is my fault - 031a072a0b8a
> > ("vfs: call vfs_clone_file_range() under freeze protection"), so
> > I'll need to deal with it.
>
> This looks like a farly large TOCTOU race condition.  i.e. what's to
> prevent a "move offline" operation run by a userspace HSM agent
> racing with an application read that has done it's FAN_PRE_MODIFY
> call but hasn't yet started/completed it's physical read from the
> filesystem? This sort of race can result in the data in the file
> being migrated and punched out by the HSM before the read gets the
> i_rwsem or page locks it needs to read the data into cache....
>
> What am I missing here? How does this "no filesystem locks"
> notification scheme avoid these sorts of spurious user data
> corruption events from occurring?
>

See my answer to Jan to the same question.
HSM will take a write lease on the file before punching out a file.
It is an existing API that I do not even need to extend.
I do need to POC this method though...

> > > > Naturally that opens the possibility for new userspace
> > > > deadlocks. Nothing that is not already possible with permission
> > > > event, but maybe deadlocks that are more inviting to trip over.
> > > >
> > > > I am not sure if we need to do anything about this, but we
> > > > could make it easier to ignore events from the HSM daemon
> > > > itself if we want to, to make the userspace implementation easier.
> > >
> > > XFS used "invisible IO" as the mechanism for avoiding sending DMAPI
> > > events for operations that we initiated by the HSM to move data into
> > > and out of the filesystem.
> > >
> > > No doubt you've heard us talk about invisible IO in the past -
> > > O_NOCMTIME is what that invisible IO has eventually turned into in a
> > > modern Linux kernel. We still use that for invisible IO - xfs_fsr
> > > uses it for moving data around during online defragmentation. The
> > > entire purpose of invisible IO was to provide a path for HSMs and
> > > userspace bulk data movers (e.g. HSM aware backup tools like
> > > xfsdump) to do IO without generating unnecessary or recursive DMAPI
> > > events....
> > >
> > > IOWs, if we want a DMAPI replacement, we will need to formalise a
> > > method of performing syscall based operations that will not trigger
> > > HSM notification events.
> >
> > This concept already exists in fanotify.
> > This is what FMODE_NONOTIFY is for.
> > The event->fd handed in FAN_ACCESS_PERM event can be use to
> > read the file without triggering a recursive event.
> > This is how Anti-Malware products scan files.
>
> So all that you need is to enable invisible IO for the data
> migration operations, then?

Correct.

>
> BTW, where does the writable fd that is passed to the userspace
> event that the HSM can use to do IO to/from the inode come from? How
> does fanotify guarantee that the recipient has the necessary
> permissions to read/write to the file it represents?
>

fanotify opens the file and installs it in the event reader process.
That has always been like that.
That's one of the reasons that fanotify requires CAP_SYS_ADMIN
for subscribing for permission events, so IOW, only CAP_SYS_ADMIN
can create fd for fanotify invisible IO and fanotify invisible lookup.

> > I have extended that concept in my POC patch to avoid recursive
> > FAN_LOOKUP_PERM event when calling Xat() syscall with
> > a dirfd with FMODE_NONOTIFY:
> >
> > https://github.com/amir73il/linux/commits/fanotify_pre_content
> >
> > > The other thing that XFS had for DMAPI was persistent storage in the
> > > inode of the event mask that inode should report events for. See
> > > the di_dmevmask and di_dmstate fields defined in the on-disk inode
> > > format here:
> > >
> > > https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
> > >
> > > There's no detail for them, but the event mask indicated what DMAPI
> > > events the inode should issue notifications for, and the state field
> > > held information about DMAPI operations in progress.
> > >
> > > The event field is the important one here - if the event field was
> > > empty, access to the inode never generated DMAPI events. When the
> > > HSM moved data offline, the "READ" event mask bit was set by the HSM
> > > and that triggered DMAPI events for any operation that needed to
> > > read data or manipulate the extent map. When the data was brought
> > > entirely back online, the event masks count be cleared.
> > >
> >
> > HSM can and should manage this persistent bit, but it does not
> > have to be done using a special ioctl.
> > We already use xattr and/or fileattr flags in our HSM implementation.
> >
> > I've mentioned this in my reply to Jan.
> > The way I intend to address this "persistent READ hook" in
> > fanotify side is by attaching an HSM specific BFP program to
> > an fanotify filesystem mark.
>
> How is that fanotify mark and BPF hook set? What triggers it?
> What makes it persistent? Why does it need a BPF program? What is
> the program actually intended to do?
>

Currently, fanotify is able to place a (non persistent) "mark" on an
inode or a mount or a filesystem to subscribe for permission events.

The BFP program is intended to be attached to a filesystem mark
to filter in the kernel the events that the user gets.

It is something we discussed and need in fanotify for other reasons
as well (e.g. subtree filtering).

When I am talking about a "persistent mark" on a file I do not mean
that after reboot, access to a file will generate a permission event.

What I mean is that after reboot, if the HSM is running and has placed
a filesystem mark with a BFP filter on the filesystem, then it will get
events on that file.

So "persistent mark" can be just about anything that HSM chooses
it to be, as long as a BFP program has a way to check that marker.

> You're making big assumptions that people actually know what you are
> trying to implement, also how and why. Perhaps you should write a
> design overview that connects all the dots and explains why things
> like custom BPF programs are necessary...

Yes. I most definitely should. That will be my next step.
And I will provide a demo HSM implementation with it.
I just wanted to send out a heads up to see if there are any
immediate reasons to shoot down the idea.

>
> > We could standartize a fileattr flag just the same as NODUMP
> > was for the use of backup applications (e.g. NODATA), but it
> > is not a prerequite, just a standard way for HSM to set the
> > persistent READ hook bit.
>
> So you do actually need the filesystem to store information about
> HSM events that need to be generated, but then all the events are
> generated at the VFS without holding filesysetm locks? That seems
> like it could be racy, too.
>

The filesystem is just a host for the flag/xattr.
Setting that marker and clearing it is completely under the responsibility
of HSM.

"no filesystem locks" is referring only to the period that the user is
handling the event. It does not mean that there are no locks
and other synchronization involved when checking the mark.

HSM will only set or remove that mark with exclusive write lease,
so I do not think checking for the mark in a race free manner is going
to be a problem. We'll see...

> > An HSM product could be configured to reappropriate NODUMP
> > for that matter or check for no READ bits in st_mode or xattr.
>
> You know better than to try to redefine the meaning of a well known
> attribute that has been exposed to userspace for 25+ years and is
> widely used...
>

I do. That means that I will not propose to reappropriate NODUMP
in kernel code, but an HSM implementation can do whatever it wants
to do, as long as we let it control the flag and check it with a BFP program.

To be fair, even a backup application cannot know that it is the only
backup application on the system, so it has always been the responsibility
of the sysadmin (or the distro) to make sure that backup applications
do not step on each other.

I will definitely push for defining a new flag that is dedicated
for HSM applications, but again, it would be the responsibility
of the sysadmin (or the distro) to make sure that HSM applications
do not step on each other.

> > > However, DMAPI also supports dual state operation, where the
> > > data in the local filesystem is also duplicated in the offline
> > > storage (e.g. immediately after a recall operation). This state can
> > > persist until data or layout is changed in the local filesystem,
> > > and so there's a "WRITE" event mask as well that allows the
> > > filesystem to inform the HSM that data it may have in offline
> > > storage is being changed.
> > >
> > > The state field is there to tell the HSM that an operation was in
> > > progress when the system crashed. As part of recovery, the HSM needs
> > > to find all the inodes that had DM operations in progress and either
> > > complete them or revert them to bring everything back to a
> > > consistent state. THe SGI HSMs used the bulkstat interfaces to scan
> > > the fs and find inodes that had a non-zero DM state field. This is
> > > one of the reasons that having bulkstat scale out to scanning
> > > millions of inodes a second ends up being important - coherency
> > > checking between the ondisk filesystem state and the userspace
> > > offline data tracking databases is a very important admin
> > > operation..
> > >
> >
> > Normally, HSM will be listening on a filesystem mark to async
> > FAN_MODIFY and FAN_CLOSE_WRITE events.
> >
> > To cover the case of crash and missing fanotify events, we use
> > the persistent change tracking journal.
> > My current prototype is in overlayfs driver using pre-modify
> > fsnotify hooks, as we discussed back in LSFMM 2018:
> >
> > https://github.com/amir73il/overlayfs/wiki/Overlayfs-watch
>
> Yup, which lead to the conclusion that the only way it could work
> was "synchronous write of the change intent to the change journal
> before the change is made".
>

Yes, but this is not a performance killer, because the performance
penalty is only paid on first modification to file in a directory
The fast path (directory is already marked dirty) does not need
any writing to the change tracking journal at all.

This is something that I am confident about, because the
overlayfs watch code is already in production for a long time
and we have lots of performance regression tests for large scale
setups (i.e. many concurrent users, many files and dirs).

> > The idea is to export those pre-modify hooks via fanotify
> > and move this implementation from the overlayfs driver to
> > userspace HSM daemon.
> >
> > Note that the name "Change Tracking Journal" may be confusing -
> > My implementation does not store a time sequence of events like
> > the NTFS Change Journal - it only stores a map of file handles of
> > directories containing new/changed/deleted files.
> >
> > Iterating this "Changed dirs map" is way faster then itereating
> > bulkstat of all inodes and looking for the WRITE bit.
>
> Maybe so, but you haven't ever provided data to indicate what sort
> of filesystem modification rate this change journal can sustain.
> Just from my understanding of what it needs to do (sync updates), I
> can't see how it could be implemented without affecting overall
> runtime file and directory modification rates adversely.
>

This does seem like an impossible trade off.
The key to getting it right is making sure that most of the directory
modifications are not "first modifications" and that depends on
the frequency of the change tracking snapshots.

> At which point I ponder what is worse - always taking the runtime
> overhead for the change journal operations to be propagated to the
> HSM to record persistently and then replaying that journal after a
> mount, or the HSM just needing to run a bulkstat scan in the
> background after mount to do a coherency check and resolution scan?
>

My solution to that was "a little bit of both" and experience from the
fields shows that there is a way to get to a good balance.
Even with weekly change tracking snapshots, the worse case
"scan at mount" times become much smaller and the once per week
per dir performance penalty is in the noise.

Essentially, the change tracking journal is most useful for users that
have massive amounts of cold and rarely accessed files.
Surely, a filesystem implementation of "smart inode iterator" that
includes user defined filters (e.g. ctime > T) would have been a very
efficient way to achieve the same thing, but I prefer a filesystem agnostic
solution.

> > The responsibility of maintaining per file "dirty" state is on HSM
> > and it can be done using the change tracking journal and an
> > external database. Filesystem provided features such as ctime
> > and iversion can be used to optimize the management of "dirty"
> > state, but they are not a prerequisite and most of the time the
> > change journal is sufficient to be able to scale up, because it
> > can give you the answer to the question:
> > "In which of the multi million dirs, do I need to look for changes
> >  to be synced to secondary storage?".
>
> Sure, it's not every file that gets recorded, but it still sounds
> like this could be a significant percentage of the filesystem
> contents that need scanning. And the HSM would still have to scan
> them by the directory structure operations, which typically costs
> about 3-4x as much IO, CPU and wall time overhead per scanned inode
> compared to bulkstat.
>

You are very much correct.

There is another advantage to directory structure traversal.
It helps HSM detect path changes, which is needed for some HSMs.

If you take into account that you need to resolve the back path from
directory inode then bulkstat loses big time to directory traversal.

An important property of overlayfs watch that I did not get into
but it is crucial to the HSM scanning stage - The journal does not only
record the changed dir, but all of its ancestors as well, so the much more
important question that the journal can answer is:
"Are there any changed files under this (huge) tree?".

Modifications to files with hardlinks or disconnected paths are trickier,
but it's workable, because those are the corner cases.

> Please note that I'm not saying that either mechanism is better, nor
> am I saying that bulkstat scanning is a perfect solution, just that
> I've got zero information about how this change journal performs and
> what impact it has on runtime operation. Bulkstat scans, OTOH, I
> understand intimately and especially in the context of HSM
> operations. i.e. I know that scanning a billion inodes only takes a
> few minutes of wall time with modern SSDs and CPUs...
>

I have also considered using bulkstat many times.
In the end it's a matter of balance. Some HSMs/data sets/workloads
may be better off with bulkstat, but then again, I prefer that at least
a filesystem agnostic solution will be available.

> > > The XFS dmapi event and state mask control APIs are now deprecated.
> > > The XFS_IOC_FSSETDM ioctl could read and write the values, and the
> > > the XFS V1 bulkstat ioctl could read them. There were also flags for
> > > things like extent mapping ioctls (FIEMAP equivalent) that ensured
> > > looking at the extent map didn't trigger DMAPI events and data
> > > recall.
> > >
> > > I guess what I'm trying to say is that there's a lot more to an
> > > efficient implementation of a HSM event notification mechanism than
> > > just implementing a couple of blocking upcalls. IF we want something
> > > that will replace even simple DMAPI-based HSM use cases, we really
> > > need to think through how to support all the operations that a
> > > recall operation might needed for and hence have to block. ANd we
> > > really should think about how to efficiently filter out unnecessary
> > > events so that we don't drown the HSM in IO events it just doesn't
> > > need to know about....
> >
> > Thinking about efficient HSM implementation and testing prototypes is
> > what I have been doing for the past 6 years in CTERA.
>
> Great!
>
> It sounds like you'll have no trouble documenting the design to
> teach us all how you've solved these problems so we understand what
> you are asking us to review... :)
>

I will try to do even better - I plan to provide reference design
prototype code, but first a wiki!

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-22 13:03               ` Amir Goldstein
@ 2022-09-26 15:27                 ` Jan Kara
  2022-09-28 12:29                   ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-09-26 15:27 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Plaster, Robert, David Howells,
	linux-fsdevel, Fufu Fang

On Thu 22-09-22 16:03:41, Amir Goldstein wrote:
> On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
> > On Tue 20-09-22 21:19:25, Amir Goldstein wrote:
> > > For the next steps of POC, I could do:
> > > - Report FAN_ACCESS_PERM range info to implement random read
> > >   patterns (e.g. unzip -l)
> > > - Introduce FAN_MODIFY_PERM, so file content could be downloaded
> > >   before modifying a read-write HSM cache
> > > - Demo conversion of a read-write FUSE HSM implementation
> > >   (e.g. https://github.com/volga629/davfs2)
> > > - Demo HSM with filesystem mark [*] and a hardcoded test filter
> > >
> > > [*] Note that unlike the case with recursive inotify, this POC HSM
> > > implementation is not racy, because of the lookup permission events.
> > > A filesystem mark is still needed to avoid pinning all the unpopulated
> > > cache tree leaf entries to inode cache, so that this HSM could work on
> > > a very large scale tree, the same as my original use case for implementing
> > > filesystem mark.
> >
> > Sounds good! Just with your concern about pinning - can't you use evictable
> > marks added on lookup for files / dirs you want to track? Maybe it isn't
> > great design for other reasons but it would save you some event
> > filtering...
> >
> 
> With the current POC, there is no trigger to re-establish the evicted mark,
> because the parent is already populated and has no mark.

So my original thinking was that you'd place FAN_LOOKUP_PERM mark on top of
the directory tree and then you'd add evictable marks to all the subdirs
that are looked up from the FAN_LOOKUP_PERM event handler. That way I'd
imagine you can place evictable marks on all directories that are used in a
race-free manner.

> A hook on instantiate of inode in inode cache could fill that gap.
> It could still be useful to filter FAN_INSTANTIATE_PERM events in the
> kernel but it is not a must because instantiate is more rare than (say) lookup
> and then the fast lookup path (RCU walk) on populated trees suffers almost
> no overhead when the filesystem is watched.
> 
> Please think about this and let me know if you think that this is a direction
> worth pursuing, now, or as a later optimization.

I think an event on instantiate seems to be depending too much on kernel
internals instead of obvious filesystem operations. Also it might be a bit
challenging during startup when you don't know what is cached and what not
so you cannot rely on instantiate events for placing marks. So I'd leave
this for future optimization.

> > > If what you are looking for is an explanation why fanotify HSM would be better
> > > than a FUSE HSM implementation then there are several reasons.
> > > Performance is at the top of the list. There is this famous USENIX paper [3]
> > > about FUSE passthrough performance.
> > > It is a bit outdated, but many parts are still relevant - you can ask
> > > the Android
> > > developers why they decided to work on FUSE-BFP...
> > >
> > > [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf
> > >
> > > For me, performance is one of the main concerns, but not the only one,
> > > so I am not entirely convinced that a full FUSE-BFP implementation would
> > > solve all my problems.
> > >
> > > When scaling to many millions of passthrough inodes, resource usage start
> > > becoming a limitation of a FUSE passthrough implementation and memory
> > > reclaim of native fs works a lot better than memory reclaim over FUSE over
> > > another native fs.
> > >
> > > When the workload works on the native filesystem, it is also possible to
> > > use native fs features (e.g. XFS ioctls).
> >
> > OK, understood. Out of curiosity you've mentioned you'd looked into
> > implementing HSM in overlayfs. What are the issues there? I assume
> > performance is very close to native one so that is likely not an issue and
> > resource usage you mention above likely is not that bad either. So I guess
> > it is that you don't want to invent hooks for userspace for moving (parts
> > of) files between offline storage and the local cache?
> 
> In a nutshell, when realizing that overlayfs needs userspace hooks
> to cater HSM, it becomes quite useless to use a stacked fs design.
> 
> Performance is not a problem with overlayfs, but like with FUSE,
> all the inodes/dentries in the system double, memory reclaim
> of layered fs becomes an awkward dance, which messes with the
> special logic of xfs shrinkers, and on top of all this, overlayfs does
> not proxy all the XFS ioctls either.
> 
> The fsnotify hooks are a much better design when realizing that
> the likely() case is to do nothing and incur least overhead and
> the unlikely() case of user hook is rare.

OK, understood. Thanks!

> > The remaining concern I have is that we should demonstrate the solution is
> > able to scale to millions of inodes (and likely more) because AFAIU that
> > are the sizes current HSM solutions are interested in. I guess this is kind
> > of covered in your last step of POCs though.
> >
> 
> Well, in $WORK we have performance test setups for those workloads,
> so part of my plan is to convert the in-house FUSE HSM
> to fanotify and make sure that all those tests do not regress.
> But that is not code, nor tests that I can release, I can only report back
> that the POC works and show the building blocks that I used on
> some open source code base.

Even this is useful I think.

> I plan to do the open source small scale POC first to show the
> building blocks so you could imagine the end results and
> then take the building blocks for a test drive in the real world.
> 
> I've put my eye on davfs2 [1] as the code base for read-write HSM
> POC, but maybe I will find an S3 FUSE fs that could work too
> I am open to other suggestions.
> 
> [1] https://github.com/volga629/davfs2
> 
> When DeepSpace Storage releases their product to github,
> I will be happy to work with them on a POC with their code
> base and I bet they could arrange a large scale test setup.
> (hint hint).

:-)

							Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-26 15:27                 ` Jan Kara
@ 2022-09-28 12:29                   ` Amir Goldstein
  2022-09-29 10:01                     ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-09-28 12:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel,
	Fufu Fang, Dave Chinner

On Mon, Sep 26, 2022 at 6:27 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 22-09-22 16:03:41, Amir Goldstein wrote:
> > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
> > > On Tue 20-09-22 21:19:25, Amir Goldstein wrote:
> > > > For the next steps of POC, I could do:
> > > > - Report FAN_ACCESS_PERM range info to implement random read
> > > >   patterns (e.g. unzip -l)
> > > > - Introduce FAN_MODIFY_PERM, so file content could be downloaded
> > > >   before modifying a read-write HSM cache
> > > > - Demo conversion of a read-write FUSE HSM implementation
> > > >   (e.g. https://github.com/volga629/davfs2)
> > > > - Demo HSM with filesystem mark [*] and a hardcoded test filter
> > > >
> > > > [*] Note that unlike the case with recursive inotify, this POC HSM
> > > > implementation is not racy, because of the lookup permission events.
> > > > A filesystem mark is still needed to avoid pinning all the unpopulated
> > > > cache tree leaf entries to inode cache, so that this HSM could work on
> > > > a very large scale tree, the same as my original use case for implementing
> > > > filesystem mark.
> > >
> > > Sounds good! Just with your concern about pinning - can't you use evictable
> > > marks added on lookup for files / dirs you want to track? Maybe it isn't
> > > great design for other reasons but it would save you some event
> > > filtering...
> > >
> >
> > With the current POC, there is no trigger to re-establish the evicted mark,
> > because the parent is already populated and has no mark.
>
> So my original thinking was that you'd place FAN_LOOKUP_PERM mark on top of
> the directory tree and then you'd add evictable marks to all the subdirs
> that are looked up from the FAN_LOOKUP_PERM event handler. That way I'd
> imagine you can place evictable marks on all directories that are used in a
> race-free manner.
>

Maybe I am missing something.
I don't see how that can scale up to provide penalty free fast path lookup
for fully populated subtrees.

> > A hook on instantiate of inode in inode cache could fill that gap.
> > It could still be useful to filter FAN_INSTANTIATE_PERM events in the
> > kernel but it is not a must because instantiate is more rare than (say) lookup
> > and then the fast lookup path (RCU walk) on populated trees suffers almost
> > no overhead when the filesystem is watched.
> >
> > Please think about this and let me know if you think that this is a direction
> > worth pursuing, now, or as a later optimization.
>
> I think an event on instantiate seems to be depending too much on kernel
> internals instead of obvious filesystem operations. Also it might be a bit
> challenging during startup when you don't know what is cached and what not
> so you cannot rely on instantiate events for placing marks. So I'd leave
> this for future optimization.
>

Perhaps a user FAN_INSTANTIATE_PERM is too much, but I was
trying to figure out a way to make automatic inode marks work.
If we can define reasonable use cases for automatic inode marks that
kernel can implement (e.g. inheriting from parent on dentry instantiate)
then this could possibly get us something useful.
Maybe that is what you meant with the suggestion above?

The other use case of automatic inode marks I was thinking about,
which are even more relevant for $SUBJECT is this:
When instantiating a dentry with an inode that has xattr
"security.fanotify.mask" (a.k.a. persistent inode mark), an inode
mark could be auto created and attached to a group with a special sb
mark (we can limit a single special mark per sb).
This could be implemented similar to get_acl(), where i_fsnotify_mask
is always initialized with a special value (i.e. FS_UNINITIALIZED)
which is set to either 0 or non-zero if "security.fanotify.mask" exists.

The details of how such an API would look like are very unclear to me,
so I will try to focus on the recursive auto inode mark idea.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-28 12:29                   ` Amir Goldstein
@ 2022-09-29 10:01                     ` Jan Kara
  2022-10-07 13:58                       ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-09-29 10:01 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Plaster, Robert, David Howells,
	linux-fsdevel, Fufu Fang, Dave Chinner

On Wed 28-09-22 15:29:13, Amir Goldstein wrote:
> On Mon, Sep 26, 2022 at 6:27 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 22-09-22 16:03:41, Amir Goldstein wrote:
> > > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 20-09-22 21:19:25, Amir Goldstein wrote:
> > > > > For the next steps of POC, I could do:
> > > > > - Report FAN_ACCESS_PERM range info to implement random read
> > > > >   patterns (e.g. unzip -l)
> > > > > - Introduce FAN_MODIFY_PERM, so file content could be downloaded
> > > > >   before modifying a read-write HSM cache
> > > > > - Demo conversion of a read-write FUSE HSM implementation
> > > > >   (e.g. https://github.com/volga629/davfs2)
> > > > > - Demo HSM with filesystem mark [*] and a hardcoded test filter
> > > > >
> > > > > [*] Note that unlike the case with recursive inotify, this POC HSM
> > > > > implementation is not racy, because of the lookup permission events.
> > > > > A filesystem mark is still needed to avoid pinning all the unpopulated
> > > > > cache tree leaf entries to inode cache, so that this HSM could work on
> > > > > a very large scale tree, the same as my original use case for implementing
> > > > > filesystem mark.
> > > >
> > > > Sounds good! Just with your concern about pinning - can't you use evictable
> > > > marks added on lookup for files / dirs you want to track? Maybe it isn't
> > > > great design for other reasons but it would save you some event
> > > > filtering...
> > > >
> > >
> > > With the current POC, there is no trigger to re-establish the evicted mark,
> > > because the parent is already populated and has no mark.
> >
> > So my original thinking was that you'd place FAN_LOOKUP_PERM mark on top of
> > the directory tree and then you'd add evictable marks to all the subdirs
> > that are looked up from the FAN_LOOKUP_PERM event handler. That way I'd
> > imagine you can place evictable marks on all directories that are used in a
> > race-free manner.
> >
> 
> Maybe I am missing something.
> I don't see how that can scale up to provide penalty free fast path lookup
> for fully populated subtrees.

No, you are right that this scheme would have non-trivial overhead in
processing the lookup events even when the tree is fully populated.

> > > A hook on instantiate of inode in inode cache could fill that gap.
> > > It could still be useful to filter FAN_INSTANTIATE_PERM events in the
> > > kernel but it is not a must because instantiate is more rare than (say) lookup
> > > and then the fast lookup path (RCU walk) on populated trees suffers almost
> > > no overhead when the filesystem is watched.
> > >
> > > Please think about this and let me know if you think that this is a direction
> > > worth pursuing, now, or as a later optimization.
> >
> > I think an event on instantiate seems to be depending too much on kernel
> > internals instead of obvious filesystem operations. Also it might be a bit
> > challenging during startup when you don't know what is cached and what not
> > so you cannot rely on instantiate events for placing marks. So I'd leave
> > this for future optimization.
> >
> 
> Perhaps a user FAN_INSTANTIATE_PERM is too much, but I was
> trying to figure out a way to make automatic inode marks work.
> If we can define reasonable use cases for automatic inode marks that
> kernel can implement (e.g. inheriting from parent on dentry instantiate)
> then this could possibly get us something useful.
> Maybe that is what you meant with the suggestion above?

Well, my suggestion was pondering if we can implement something like
automatic inode marks in userspace using FAN_LOOKUP_PERM event. But you are
right the overhead in the fast path does not make this very attractive. So
we'll have to look more into the in-kernel solution.

> The other use case of automatic inode marks I was thinking about,
> which are even more relevant for $SUBJECT is this:
> When instantiating a dentry with an inode that has xattr
> "security.fanotify.mask" (a.k.a. persistent inode mark), an inode
> mark could be auto created and attached to a group with a special sb
> mark (we can limit a single special mark per sb).
> This could be implemented similar to get_acl(), where i_fsnotify_mask
> is always initialized with a special value (i.e. FS_UNINITIALIZED)
> which is set to either 0 or non-zero if "security.fanotify.mask" exists.
> 
> The details of how such an API would look like are very unclear to me,
> so I will try to focus on the recursive auto inode mark idea.

Yeah, although initializing fanotify marks based on xattrs does not look
completely crazy I can see a lot of open questions there so I think
automatic inode mark idea has more chances for success at this point :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-29 10:01                     ` Jan Kara
@ 2022-10-07 13:58                       ` Amir Goldstein
  2022-10-12 15:44                         ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-10-07 13:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

[reducing CC]

> > The other use case of automatic inode marks I was thinking about,
> > which are even more relevant for $SUBJECT is this:
> > When instantiating a dentry with an inode that has xattr
> > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode
> > mark could be auto created and attached to a group with a special sb
> > mark (we can limit a single special mark per sb).
> > This could be implemented similar to get_acl(), where i_fsnotify_mask
> > is always initialized with a special value (i.e. FS_UNINITIALIZED)
> > which is set to either 0 or non-zero if "security.fanotify.mask" exists.
> >
> > The details of how such an API would look like are very unclear to me,
> > so I will try to focus on the recursive auto inode mark idea.
>
> Yeah, although initializing fanotify marks based on xattrs does not look
> completely crazy I can see a lot of open questions there so I think
> automatic inode mark idea has more chances for success at this point :).

I realized that there is one sort of "persistent mark" who raises
less questions - one that only has an ignore mask.

ignore masks can have a "static" namespace that is not bound to any
specific group, but rather a set of groups that join this namespace.

I played with this idea and wrote some patches:
https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask

This may end up being useful in the HSM POC - i.e., HSM places
persistent ignore mask for permission events for populated dirs/files
and removes the persistent mask before punching a hole.

Haven't forgotten about the promised wiki.
For now, I just wanted to share this idea.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-10-07 13:58                       ` Amir Goldstein
@ 2022-10-12 15:44                         ` Jan Kara
  2022-10-12 16:28                           ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-10-12 15:44 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

Hi Amir!

On Fri 07-10-22 16:58:21, Amir Goldstein wrote:
> > > The other use case of automatic inode marks I was thinking about,
> > > which are even more relevant for $SUBJECT is this:
> > > When instantiating a dentry with an inode that has xattr
> > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode
> > > mark could be auto created and attached to a group with a special sb
> > > mark (we can limit a single special mark per sb).
> > > This could be implemented similar to get_acl(), where i_fsnotify_mask
> > > is always initialized with a special value (i.e. FS_UNINITIALIZED)
> > > which is set to either 0 or non-zero if "security.fanotify.mask" exists.
> > >
> > > The details of how such an API would look like are very unclear to me,
> > > so I will try to focus on the recursive auto inode mark idea.
> >
> > Yeah, although initializing fanotify marks based on xattrs does not look
> > completely crazy I can see a lot of open questions there so I think
> > automatic inode mark idea has more chances for success at this point :).
> 
> I realized that there is one sort of "persistent mark" who raises
> less questions - one that only has an ignore mask.
> 
> ignore masks can have a "static" namespace that is not bound to any
> specific group, but rather a set of groups that join this namespace.
> 
> I played with this idea and wrote some patches:
> https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask

I have glanced over the patches. In general the idea looks OK to me but I
have some concerns:

1) Technically, it may be challenging to call into filesystem xattr
handling code on first event generated by the inode - that may generate
some unexpected lock recursion for some filesystems and some events which
trigger the initialization...

2) What if you set the xattr while the group is already listening to
events? Currently the change will get ignored, won't it? But I guess this
could be handled by clearing the "cached" flag when the xattr is set.

3) What if multiple applications want to use the persistent mark
functionality? I think we need some way to associate a particular
fanotify group with a particular subset of fanotify xattrs so that
coexistence of multiple applications is possible...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-10-12 15:44                         ` Jan Kara
@ 2022-10-12 16:28                           ` Amir Goldstein
  2022-10-13 12:16                             ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-10-12 16:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
> On Fri 07-10-22 16:58:21, Amir Goldstein wrote:
> > > > The other use case of automatic inode marks I was thinking about,
> > > > which are even more relevant for $SUBJECT is this:
> > > > When instantiating a dentry with an inode that has xattr
> > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode
> > > > mark could be auto created and attached to a group with a special sb
> > > > mark (we can limit a single special mark per sb).
> > > > This could be implemented similar to get_acl(), where i_fsnotify_mask
> > > > is always initialized with a special value (i.e. FS_UNINITIALIZED)
> > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists.
> > > >
> > > > The details of how such an API would look like are very unclear to me,
> > > > so I will try to focus on the recursive auto inode mark idea.
> > >
> > > Yeah, although initializing fanotify marks based on xattrs does not look
> > > completely crazy I can see a lot of open questions there so I think
> > > automatic inode mark idea has more chances for success at this point :).
> >
> > I realized that there is one sort of "persistent mark" who raises
> > less questions - one that only has an ignore mask.
> >
> > ignore masks can have a "static" namespace that is not bound to any
> > specific group, but rather a set of groups that join this namespace.
> >
> > I played with this idea and wrote some patches:
> > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask
>
> I have glanced over the patches. In general the idea looks OK to me but I
> have some concerns:
>
> 1) Technically, it may be challenging to call into filesystem xattr
> handling code on first event generated by the inode - that may generate
> some unexpected lock recursion for some filesystems and some events which
> trigger the initialization...

That may be a correct statement in general, but please note that
- Only permission events trigger auto-init of xattr ignore mask
- Permission events are called from security hooks
- Security hooks may also call getxattr to get the security context

Perhaps LSMs always initialize the security context in OPEN and
never in ACCESS?

One of the earlier versions of the patch initialized xattr ignore mask
on LOOKUP permission event, if ANY object was interested in ANY
permission event even if no object was interested in LOOKUP
to mimic the LSM context initialization,
but it was complicated and I wasn't sure if this was necessary.

Maybe that is the way to go...

>
> 2) What if you set the xattr while the group is already listening to
> events? Currently the change will get ignored, won't it? But I guess this
> could be handled by clearing the "cached" flag when the xattr is set.
>

I have created an API to update the xattr via
  fanotify_mark(FAN_MARK_XATTR, ...
which updates the cached ignore mask in the connector.

I see no reason to support "direct" modifications of this xattr.
If such changes are made directly it is fine to ignore them.

> 3) What if multiple applications want to use the persistent mark
> functionality? I think we need some way to associate a particular
> fanotify group with a particular subset of fanotify xattrs so that
> coexistence of multiple applications is possible...
>

Yeh, I thought about this as well.
The API in the patches is quite naive because it implements a single
global namespace for xattr ignore masks, but mostly I wanted to
see if I can get the fast path and auto-init implementation done.

I was generally thinking of ioctl() as the API to join an xattr marks
namespace and negotiate the on-disk format of persistent marks
supported by the application.

I would not want to allow multiple fanotify xattrs per inode -
that could have the consequence of the inode becoming a junkyard.

I'd prefer to have a single xattr (say security.fanotify.mark)
and that mark will have
- on-disk format version
- namespace id
- ignore mask
- etc

If multiple applications want to use persistent marks they need to figure
out how to work together without stepping on each other's toes.
I don't think it is up to fanotify to coordinate that.

fanotify_mark() can fail with EEXIST when a group that joined namespace A
is trying to update a persistent mark when a persistent mark of namespace B
already exists and probably some FAN_MARK_REPLACE flag could be used
to force overwrite the existing persistent mark.

Do you agree?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-10-12 16:28                           ` Amir Goldstein
@ 2022-10-13 12:16                             ` Amir Goldstein
  2022-11-03 12:57                               ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-10-13 12:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Wed, Oct 12, 2022 at 7:28 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hi Amir!
> >
> > On Fri 07-10-22 16:58:21, Amir Goldstein wrote:
> > > > > The other use case of automatic inode marks I was thinking about,
> > > > > which are even more relevant for $SUBJECT is this:
> > > > > When instantiating a dentry with an inode that has xattr
> > > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode
> > > > > mark could be auto created and attached to a group with a special sb
> > > > > mark (we can limit a single special mark per sb).
> > > > > This could be implemented similar to get_acl(), where i_fsnotify_mask
> > > > > is always initialized with a special value (i.e. FS_UNINITIALIZED)
> > > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists.
> > > > >
> > > > > The details of how such an API would look like are very unclear to me,
> > > > > so I will try to focus on the recursive auto inode mark idea.
> > > >
> > > > Yeah, although initializing fanotify marks based on xattrs does not look
> > > > completely crazy I can see a lot of open questions there so I think
> > > > automatic inode mark idea has more chances for success at this point :).
> > >
> > > I realized that there is one sort of "persistent mark" who raises
> > > less questions - one that only has an ignore mask.
> > >
> > > ignore masks can have a "static" namespace that is not bound to any
> > > specific group, but rather a set of groups that join this namespace.
> > >
> > > I played with this idea and wrote some patches:
> > > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask
> >
> > I have glanced over the patches. In general the idea looks OK to me but I
> > have some concerns:
> >
> > 1) Technically, it may be challenging to call into filesystem xattr
> > handling code on first event generated by the inode - that may generate
> > some unexpected lock recursion for some filesystems and some events which
> > trigger the initialization...
>
> That may be a correct statement in general, but please note that
> - Only permission events trigger auto-init of xattr ignore mask
> - Permission events are called from security hooks
> - Security hooks may also call getxattr to get the security context
>
> Perhaps LSMs always initialize the security context in OPEN and
> never in ACCESS?
>
> One of the earlier versions of the patch initialized xattr ignore mask
> on LOOKUP permission event, if ANY object was interested in ANY
> permission event even if no object was interested in LOOKUP
> to mimic the LSM context initialization,
> but it was complicated and I wasn't sure if this was necessary.
>

Also, permission events can sleep by definition
so why would getxattr not be safe in the
context of permission events handling?

>
> >
> > 2) What if you set the xattr while the group is already listening to
> > events? Currently the change will get ignored, won't it? But I guess this
> > could be handled by clearing the "cached" flag when the xattr is set.
> >
>
> I have created an API to update the xattr via
>   fanotify_mark(FAN_MARK_XATTR, ...
> which updates the cached ignore mask in the connector.
>
> I see no reason to support "direct" modifications of this xattr.
> If such changes are made directly it is fine to ignore them.
>
> > 3) What if multiple applications want to use the persistent mark
> > functionality? I think we need some way to associate a particular
> > fanotify group with a particular subset of fanotify xattrs so that
> > coexistence of multiple applications is possible...
> >
>
> Yeh, I thought about this as well.
> The API in the patches is quite naive because it implements a single
> global namespace for xattr ignore masks, but mostly I wanted to
> see if I can get the fast path and auto-init implementation done.
>
> I was generally thinking of ioctl() as the API to join an xattr marks
> namespace and negotiate the on-disk format of persistent marks
> supported by the application.
>
> I would not want to allow multiple fanotify xattrs per inode -
> that could have the consequence of the inode becoming a junkyard.
>
> I'd prefer to have a single xattr (say security.fanotify.mark)
> and that mark will have
> - on-disk format version
> - namespace id
> - ignore mask
> - etc
>
> If multiple applications want to use persistent marks they need to figure
> out how to work together without stepping on each other's toes.
> I don't think it is up to fanotify to coordinate that.
>
> fanotify_mark() can fail with EEXIST when a group that joined namespace A
> is trying to update a persistent mark when a persistent mark of namespace B
> already exists and probably some FAN_MARK_REPLACE flag could be used
> to force overwrite the existing persistent mark.
>

One thing that I feel a bit silly about is something that
I only fully noticed after writing this WIP wiki entry:
https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#persistent-inode-marks

Persistent marks (in xattr) with ignore mask are useful, but only a
little bit more
useful than Evictable marks with ignore mask.

Persistent marks (in xattr) with a "positive" event mask are the real deal.
Because with "positive" persistent marks, we will be able to optimize away
srcu_read_lock() and marks iteration for the majority of fsnotify hooks
by looking at objects interest masks (including the FS_XATTR_CACHED bit).

The good thing about the POC patches [1] is that there is no technical
limitation
in this code for using persistent marks with positive event masks.
It is a bit more challenging to document the fact that a "normal" fs/mount mark
is needed in order to "activate" the persistent marks in the inodes of
this fs/mount,
but the changes to the code to support that would be minimal.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-09-22 10:48             ` Jan Kara
  2022-09-22 13:03               ` Amir Goldstein
@ 2022-10-28 12:50               ` Amir Goldstein
  2022-11-03 16:30                 ` Jan Kara
  1 sibling, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-10-28 12:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 20-09-22 21:19:25, Amir Goldstein wrote:
[...]
> > Hi Jan,
> >
> > I wanted to give an update on the POC that I am working on.
> > I decided to find a FUSE HSM and show how it may be converted
> > to use fanotify HSM hooks.
> >
> > HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local
> > cache from a remote http on first access to every directory and file range.
> >
> > Normally, it would be run like this:
> > ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/
> >
> > Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements
> > passthrough calls to the local cache dir if cache is already populated.
> >
> > After my conversion patches [1], this download-only HSM can be run like
> > this without mounting FUSE:
> >
> > sudo ./httpdirfs --fanotify --cache-location /vdf/cache
> > https://cdn.kernel.org/pub/ -
> >
> > [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content
> >
> > Browsing the cache directory at /vdf/cache, lazyly populates the local cache
> > using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content
> > using FAN_ACCESS_PERM read hooks.
> >
> > Up to this point, the implementation did not require any kernel changes.
> > However, this type of command does not populate the path components,
> > because lookup does not generate FAN_ACCESS_PERM event:
> >
> > stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz
> >
> > To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM
> > event [2] and used it to lazyly populate directories in the path ancestry.
> > For now, I stuck with the XXX_PERM convention and did not require
> > FAN_CLASS_PRE_CONTENT, although we probably should.
> >
> > [2] https://github.com/amir73il/linux/commits/fanotify_pre_content
> >
> > Streaming read of large files works as well, but only for sequential read
> > patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events
> > do not (yet) carry range info, so my naive implementation downloads
> > one extra data chunk on each FAN_ACCESS_PERM until the cache file is full.
> >
> > This makes it possible to run commands like:
> >
> > tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz
> > | less
> >
> > without having to wait for the entire 400MB file to download before
> > seeing the first page.
> >
> > This streaming feature is extremely important for modern HSMs
> > that are often used to archive large media files in the cloud.
>
> Thanks for update Amir! I've glanced through the series and so far it looks
> pretty simple and I'd have only some style / readability nits (but let's
> resolve those once we have something more complete).
>
> When thinking about HSM (and while following your discussion with Dave) I
> wondered about one thing: When the notifications happen before we take
> locks, then we are in principle prone to time-to-check-time-to-use races,
> aren't we? How are these resolved?
>
> For example something like:
> We have file with size 16k.
> Reader:                         Writer
>   read 8k at offset 12k
>     -> notification sent
>     - HSM makes sure 12-16k is here and 16-20k is beyond eof so nothing to do
>
>                                 expand file to 20k
>   - now the file contents must not get moved out until the reader is
>     done in order not to break it
>

Hi Jan,

It's been a while since I updated this topic.
I have been making progress on the wiki and POC, but it's not done yet.

I would like to poke your brain about my proposed solutions for the
TOCTOU race issues, because the solution is subtle and you may have
better ideas to suggest.

>
> > Questions:
> > - What do you think about the direction this POC has taken so far?
> > - Is there anything specific that you would like to see in the POC
> >   to be convinced that this API will be useful?
>
> I think your POC is taking a good direction and your discussion with Dave
> had made me more confident that this is all workable :). I liked your idea
> of the wiki (or whatever form of documentation) that summarizes what we've
> discussed in this thread. That would be actually pretty nice for future
> reference.
>

The current state of POC is that "populate of access" of both files
and directories is working and "race free evict of file content" is also
implemented (safely AFAIK).

The technique involving exclusive write lease is discussed at [1].
In a nutshell, populate and evict synchronize on atomic i_writecount
and this technique can be implemented with upstream UAPIs.

I did use persistent xattr marks for the POC, but this is not a must.
Evictable inode marks would have worked just as well.

Now I have started to work on persistent change tracking.
For this, I have only kernel code, only lightly tested, but I did not
prove yet that the technique is working.

The idea that I started to sketch at [2] is to alternate between two groups.

When a change is recorded, an evictable ignore mark will be added on the object.
To start recording changes from a new point in time (checkpoint), a new group
will be created (with no ignore marks) and the old group will be closed.

The core of the algorithm is the "safe handover" between groups.
This requires two infrastructure additions.

The first is FAN_MARK_SYNC [3] as described in commit message:
---
    Synchronous add of mark or remove/flush of marks with ignore mask
    provides a method for safe handover of event handling between two groups:

    - First, group A subscribes to some events with FAN_MARK_SYNC
    - Then, group B unsubscribes from those events

    This method guarantees that any event that both groups subscribed
    to, will be delivered to either group or to both of them.

    Note that FAN_MARK_SYNC provides no synchronization to the object
    interest masks, which are checked outside srcu read side.
    Therefore, this method does not provide any guarantee regarding
    delivery of events which only one of the groups is subscribed to.

    For example, if only group B was subscribed to FAN_OPEN_EXEC and only
    group A is subscribing only to FAN_OPEN, an execution of a binary file
    may not deliver FAN_OPEN_EXEC to group B nor FAN_OPEN to group A.
---

The second is to overlap fsnotify_mark_srcu read side with sb_start_write(),
for pre modify permission events [4] as described in commit message:
---
    fsnotify: acquire sb write access inside pre modify permission event

    For pre modify permission events, acquire sb write access before
    leaving SRCU and return >0 to signal that sb write access was acquired.

    This can be used to implement safe "handover" of pre modify permission
    events between two fanotify groups:

    - First, group A subscribes to pre modify events with FAN_MARK_SYNC
    - Then, a freeze/thaw cycle is performed on the filesystem
    - Finally, group B unsubscribes from those events

    This method guarantees that a pre modify event that both groups
    subscribed to will be delivered to either group or to both of them.

    In case that the pre modify event is delivered only to group B, the
    freeze/thaw cycle guarantees that the filesystem modification that
    followed that pre modify event was also completed, before the handover
    is complete and group B can be closed.

    For pre rename permission event, acquire sb write access after the
    second of the event pair (i.e. rename to) was authorized.
---

What do you think about this handover technique?
Do you think that it is workable or do you see any major flaws in it?
Would you use a different or an additional synchronization primitive
instead of abusing fsnotify_mark_srcu?

To clarify, the race that I am trying to avoid is:
1. group B got a pre modify event and recorded the change before time T
2. The actual modification is performed after time T
3. group A does not get a pre modify event, so does not record the change
    in the checkpoint since T

Thanks,
Amir.

[1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#invalidating-local-cache
[2] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#tracking-local-modifications
[3] https://github.com/amir73il/linux/commits/fan_mark_sync
[4] https://github.com/amir73il/linux/commits/fan_modify_perm

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-10-13 12:16                             ` Amir Goldstein
@ 2022-11-03 12:57                               ` Jan Kara
  2022-11-03 13:38                                 ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-03 12:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

I'm sorry for the really delayed response. We had an internal conference
and some stuff around that which made me busy.

On Thu 13-10-22 15:16:25, Amir Goldstein wrote:
> On Wed, Oct 12, 2022 at 7:28 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Hi Amir!
> > >
> > > On Fri 07-10-22 16:58:21, Amir Goldstein wrote:
> > > > > > The other use case of automatic inode marks I was thinking about,
> > > > > > which are even more relevant for $SUBJECT is this:
> > > > > > When instantiating a dentry with an inode that has xattr
> > > > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode
> > > > > > mark could be auto created and attached to a group with a special sb
> > > > > > mark (we can limit a single special mark per sb).
> > > > > > This could be implemented similar to get_acl(), where i_fsnotify_mask
> > > > > > is always initialized with a special value (i.e. FS_UNINITIALIZED)
> > > > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists.
> > > > > >
> > > > > > The details of how such an API would look like are very unclear to me,
> > > > > > so I will try to focus on the recursive auto inode mark idea.
> > > > >
> > > > > Yeah, although initializing fanotify marks based on xattrs does not look
> > > > > completely crazy I can see a lot of open questions there so I think
> > > > > automatic inode mark idea has more chances for success at this point :).
> > > >
> > > > I realized that there is one sort of "persistent mark" who raises
> > > > less questions - one that only has an ignore mask.
> > > >
> > > > ignore masks can have a "static" namespace that is not bound to any
> > > > specific group, but rather a set of groups that join this namespace.
> > > >
> > > > I played with this idea and wrote some patches:
> > > > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask
> > >
> > > I have glanced over the patches. In general the idea looks OK to me but I
> > > have some concerns:
> > >
> > > 1) Technically, it may be challenging to call into filesystem xattr
> > > handling code on first event generated by the inode - that may generate
> > > some unexpected lock recursion for some filesystems and some events which
> > > trigger the initialization...
> >
> > That may be a correct statement in general, but please note that
> > - Only permission events trigger auto-init of xattr ignore mask
> > - Permission events are called from security hooks
> > - Security hooks may also call getxattr to get the security context
> >
> > Perhaps LSMs always initialize the security context in OPEN and
> > never in ACCESS?
> >
> > One of the earlier versions of the patch initialized xattr ignore mask
> > on LOOKUP permission event, if ANY object was interested in ANY
> > permission event even if no object was interested in LOOKUP
> > to mimic the LSM context initialization,
> > but it was complicated and I wasn't sure if this was necessary.
> >
> 
> Also, permission events can sleep by definition
> so why would getxattr not be safe in the
> context of permission events handling?

Well, I'm not afraid of sleeping. I was more worried about lock ordering
issues. But are right that this probably isn't going to be an issue.

> > > 2) What if you set the xattr while the group is already listening to
> > > events? Currently the change will get ignored, won't it? But I guess this
> > > could be handled by clearing the "cached" flag when the xattr is set.
> > >
> >
> > I have created an API to update the xattr via
> >   fanotify_mark(FAN_MARK_XATTR, ...
> > which updates the cached ignore mask in the connector.
> >
> > I see no reason to support "direct" modifications of this xattr.
> > If such changes are made directly it is fine to ignore them.
> >
> > > 3) What if multiple applications want to use the persistent mark
> > > functionality? I think we need some way to associate a particular
> > > fanotify group with a particular subset of fanotify xattrs so that
> > > coexistence of multiple applications is possible...
> > >
> >
> > Yeh, I thought about this as well.
> > The API in the patches is quite naive because it implements a single
> > global namespace for xattr ignore masks, but mostly I wanted to
> > see if I can get the fast path and auto-init implementation done.
> >
> > I was generally thinking of ioctl() as the API to join an xattr marks
> > namespace and negotiate the on-disk format of persistent marks
> > supported by the application.
> >
> > I would not want to allow multiple fanotify xattrs per inode -
> > that could have the consequence of the inode becoming a junkyard.
> >
> > I'd prefer to have a single xattr (say security.fanotify.mark)
> > and that mark will have
> > - on-disk format version
> > - namespace id
> > - ignore mask
> > - etc
> >
> > If multiple applications want to use persistent marks they need to figure
> > out how to work together without stepping on each other's toes.
> > I don't think it is up to fanotify to coordinate that.

I'm not sure if this really scales. Imagine you have your say backup
application that wants to use persistent marks and then you have your HSM
application wanting to do the same. Or you have some daemon caching
preparsed contents of config/ directory and watching whether it needs to
rescan the dir and rebuild the cache using persistent marks (I'm hearing
requests like these for persistent marks from desktop people for many
years). How exactly are these going to coordinate?

I fully understand your concern about the clutter in inode xattrs but if
we're going to limit the kernel to support only one persistent marks user,
then IMO we also need to provide a userspace counterpart (in the form of
some service or library like persistent change notification journal) that
will handle the coordination. Because otherwise it will become a mess
rather quickly.

> > fanotify_mark() can fail with EEXIST when a group that joined namespace A
> > is trying to update a persistent mark when a persistent mark of namespace B
> > already exists and probably some FAN_MARK_REPLACE flag could be used
> > to force overwrite the existing persistent mark.
> 
> One thing that I feel a bit silly about is something that I only fully
> noticed after writing this WIP wiki entry:
> https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#persistent-inode-marks
> 
> Persistent marks (in xattr) with ignore mask are useful, but only a
> little bit more useful than Evictable marks with ignore mask.
> 
> Persistent marks (in xattr) with a "positive" event mask are the real deal.
> Because with "positive" persistent marks, we will be able to optimize away
> srcu_read_lock() and marks iteration for the majority of fsnotify hooks
> by looking at objects interest masks (including the FS_XATTR_CACHED bit).
> 
> The good thing about the POC patches [1] is that there is no technical
> limitation in this code for using persistent marks with positive event
> masks.  It is a bit more challenging to document the fact that a "normal"
> fs/mount mark is needed in order to "activate" the persistent marks in
> the inodes of this fs/mount, but the changes to the code to support that
> would be minimal.

I agree persistent positive marks are very interesting (in fact I've heard
requests for functionality like this about 15 years ago :)). But if you'd
like to use them e.g. for backup or HSM then you need to somehow make sure
you didn't miss any events before you created the activation mark? That
looks like a bit unpleasant race with possible data (backup) corruption
consequences?

> [1] https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-03 12:57                               ` Jan Kara
@ 2022-11-03 13:38                                 ` Amir Goldstein
  0 siblings, 0 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-11-03 13:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Thu, Nov 3, 2022 at 2:57 PM Jan Kara <jack@suse.cz> wrote:
>
> I'm sorry for the really delayed response. We had an internal conference
> and some stuff around that which made me busy.
>

No problem at all.
There is a lot to process in this thread.
My follow up email about avoiding TOCTOU is worse...
I am happy for whatever feedback you can provide me when you have the time.

> On Thu 13-10-22 15:16:25, Amir Goldstein wrote:
> > On Wed, Oct 12, 2022 at 7:28 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Hi Amir!
> > > >
> > > > On Fri 07-10-22 16:58:21, Amir Goldstein wrote:
> > > > > > > The other use case of automatic inode marks I was thinking about,
> > > > > > > which are even more relevant for $SUBJECT is this:
> > > > > > > When instantiating a dentry with an inode that has xattr
> > > > > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode
> > > > > > > mark could be auto created and attached to a group with a special sb
> > > > > > > mark (we can limit a single special mark per sb).
> > > > > > > This could be implemented similar to get_acl(), where i_fsnotify_mask
> > > > > > > is always initialized with a special value (i.e. FS_UNINITIALIZED)
> > > > > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists.
> > > > > > >
> > > > > > > The details of how such an API would look like are very unclear to me,
> > > > > > > so I will try to focus on the recursive auto inode mark idea.
> > > > > >
> > > > > > Yeah, although initializing fanotify marks based on xattrs does not look
> > > > > > completely crazy I can see a lot of open questions there so I think
> > > > > > automatic inode mark idea has more chances for success at this point :).
> > > > >
> > > > > I realized that there is one sort of "persistent mark" who raises
> > > > > less questions - one that only has an ignore mask.
> > > > >
> > > > > ignore masks can have a "static" namespace that is not bound to any
> > > > > specific group, but rather a set of groups that join this namespace.
> > > > >
> > > > > I played with this idea and wrote some patches:
> > > > > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask
> > > >
> > > > I have glanced over the patches. In general the idea looks OK to me but I
> > > > have some concerns:
> > > >
> > > > 1) Technically, it may be challenging to call into filesystem xattr
> > > > handling code on first event generated by the inode - that may generate
> > > > some unexpected lock recursion for some filesystems and some events which
> > > > trigger the initialization...
> > >
> > > That may be a correct statement in general, but please note that
> > > - Only permission events trigger auto-init of xattr ignore mask
> > > - Permission events are called from security hooks
> > > - Security hooks may also call getxattr to get the security context
> > >
> > > Perhaps LSMs always initialize the security context in OPEN and
> > > never in ACCESS?
> > >
> > > One of the earlier versions of the patch initialized xattr ignore mask
> > > on LOOKUP permission event, if ANY object was interested in ANY
> > > permission event even if no object was interested in LOOKUP
> > > to mimic the LSM context initialization,
> > > but it was complicated and I wasn't sure if this was necessary.
> > >
> >
> > Also, permission events can sleep by definition
> > so why would getxattr not be safe in the
> > context of permission events handling?
>
> Well, I'm not afraid of sleeping. I was more worried about lock ordering
> issues. But are right that this probably isn't going to be an issue.
>
> > > > 2) What if you set the xattr while the group is already listening to
> > > > events? Currently the change will get ignored, won't it? But I guess this
> > > > could be handled by clearing the "cached" flag when the xattr is set.
> > > >
> > >
> > > I have created an API to update the xattr via
> > >   fanotify_mark(FAN_MARK_XATTR, ...
> > > which updates the cached ignore mask in the connector.
> > >
> > > I see no reason to support "direct" modifications of this xattr.
> > > If such changes are made directly it is fine to ignore them.
> > >
> > > > 3) What if multiple applications want to use the persistent mark
> > > > functionality? I think we need some way to associate a particular
> > > > fanotify group with a particular subset of fanotify xattrs so that
> > > > coexistence of multiple applications is possible...
> > > >
> > >
> > > Yeh, I thought about this as well.
> > > The API in the patches is quite naive because it implements a single
> > > global namespace for xattr ignore masks, but mostly I wanted to
> > > see if I can get the fast path and auto-init implementation done.
> > >
> > > I was generally thinking of ioctl() as the API to join an xattr marks
> > > namespace and negotiate the on-disk format of persistent marks
> > > supported by the application.
> > >
> > > I would not want to allow multiple fanotify xattrs per inode -
> > > that could have the consequence of the inode becoming a junkyard.
> > >
> > > I'd prefer to have a single xattr (say security.fanotify.mark)
> > > and that mark will have
> > > - on-disk format version
> > > - namespace id
> > > - ignore mask
> > > - etc
> > >
> > > If multiple applications want to use persistent marks they need to figure
> > > out how to work together without stepping on each other's toes.
> > > I don't think it is up to fanotify to coordinate that.
>
> I'm not sure if this really scales. Imagine you have your say backup
> application that wants to use persistent marks and then you have your HSM
> application wanting to do the same. Or you have some daemon caching
> preparsed contents of config/ directory and watching whether it needs to
> rescan the dir and rebuild the cache using persistent marks (I'm hearing
> requests like these for persistent marks from desktop people for many
> years). How exactly are these going to coordinate?
>
> I fully understand your concern about the clutter in inode xattrs but if
> we're going to limit the kernel to support only one persistent marks user,
> then IMO we also need to provide a userspace counterpart (in the form of
> some service or library like persistent change notification journal) that
> will handle the coordination. Because otherwise it will become a mess
> rather quickly.
>

The concept of singleton userspace services exists, but getting their
UAPI right is a challenge.
We can draw inspiration from Windows, which is decades a head of
Linux w.r.t persistent fs notifications:
https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine

IIUC, the UAPI allows a single CloudSync engine to register per fs (or
subtree root)
to get the callbacks from a file with the persistent marks (called
reparse points)
in the "Windows.Storage.Provider" namespace.

IOW, you may have OneDrive content provider or GoogleDrive content provider
serving the persistent marks on a specific directory, never both.

But Windows does support different namespaces for reparse points,
so it's an example for both sides of our arguments.

> > > fanotify_mark() can fail with EEXIST when a group that joined namespace A
> > > is trying to update a persistent mark when a persistent mark of namespace B
> > > already exists and probably some FAN_MARK_REPLACE flag could be used
> > > to force overwrite the existing persistent mark.
> >
> > One thing that I feel a bit silly about is something that I only fully
> > noticed after writing this WIP wiki entry:
> > https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#persistent-inode-marks
> >
> > Persistent marks (in xattr) with ignore mask are useful, but only a
> > little bit more useful than Evictable marks with ignore mask.
> >
> > Persistent marks (in xattr) with a "positive" event mask are the real deal.
> > Because with "positive" persistent marks, we will be able to optimize away
> > srcu_read_lock() and marks iteration for the majority of fsnotify hooks
> > by looking at objects interest masks (including the FS_XATTR_CACHED bit).
> >
> > The good thing about the POC patches [1] is that there is no technical
> > limitation in this code for using persistent marks with positive event
> > masks.  It is a bit more challenging to document the fact that a "normal"
> > fs/mount mark is needed in order to "activate" the persistent marks in
> > the inodes of this fs/mount, but the changes to the code to support that
> > would be minimal.
>
> I agree persistent positive marks are very interesting (in fact I've heard
> requests for functionality like this about 15 years ago :)). But if you'd
> like to use them e.g. for backup or HSM then you need to somehow make sure
> you didn't miss any events before you created the activation mark? That
> looks like a bit unpleasant race with possible data (backup) corruption
> consequences?
>

Yeh, positive persistent marks is quite futuristic at this point.
My POC will stick to mount mark and inode ignore marks, where these
races are rather easy to avoid using FAN_MARK_SYNC (see next email).

Anyway, I have made a decision to aim for an initial HSM UAPI implementation
with evictable ignore marks (without persistent marks) because evictable marks
are functionally sufficient and persistent marks have too many UAPI
open questions.

I've modified the POC code to work to try persistent ignore marks
and fall back to evictable ignore marks, which incur a few extra userspace
callbacks after service restart or after drop caches.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-10-28 12:50               ` Amir Goldstein
@ 2022-11-03 16:30                 ` Jan Kara
  2022-11-04  8:17                   ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-03 16:30 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

On Fri 28-10-22 15:50:04, Amir Goldstein wrote:
> On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
> >
> > > Questions:
> > > - What do you think about the direction this POC has taken so far?
> > > - Is there anything specific that you would like to see in the POC
> > >   to be convinced that this API will be useful?
> >
> > I think your POC is taking a good direction and your discussion with Dave
> > had made me more confident that this is all workable :). I liked your idea
> > of the wiki (or whatever form of documentation) that summarizes what we've
> > discussed in this thread. That would be actually pretty nice for future
> > reference.
> >
> 
> The current state of POC is that "populate of access" of both files
> and directories is working and "race free evict of file content" is also
> implemented (safely AFAIK).
> 
> The technique involving exclusive write lease is discussed at [1].
> In a nutshell, populate and evict synchronize on atomic i_writecount
> and this technique can be implemented with upstream UAPIs.

Not so much i_writecount AFAIU but the generic lease mechanism overall. But
yes, the currently existing APIs should be enough for your purposes.

> I did use persistent xattr marks for the POC, but this is not a must.
> Evictable inode marks would have worked just as well.

OK.

> Now I have started to work on persistent change tracking.
> For this, I have only kernel code, only lightly tested, but I did not
> prove yet that the technique is working.
> 
> The idea that I started to sketch at [2] is to alternate between two groups.
> 
> When a change is recorded, an evictable ignore mark will be added on the
> object.  To start recording changes from a new point in time
> (checkpoint), a new group will be created (with no ignore marks) and the
> old group will be closed.

So what I dislike about the scheme with handover between two groups is that
it is somewhat complex and furthermore requiring fs freezing for checkpoint
is going to be rather expensive (and may be problematic if persistent
change tracking is used by potentially many unpriviledged applications).

As a side note I think it will be quite useful to be able to request
checkpoint only for a subtree (e.g. some app may be interested only in a
particular subtree) and the scheme with two groups will make any
optimizations to benefit from such fact more difficult - either we create
new group without ignore marks and then have to re-record changes nobody
actually needs or we have to duplicate ignore marks which is potentially
expensive as well.

Let's think about the race:

> To clarify, the race that I am trying to avoid is:
> 1. group B got a pre modify event and recorded the change before time T
> 2. The actual modification is performed after time T
> 3. group A does not get a pre modify event, so does not record the change
>     in the checkpoint since T

AFAIU you are worried about:

Task T				Change journal		App

write(file)
  generate pre_modify event
				record 'file' as modified
							Request changes
							Records 'file' contents
  modify 'file' data

...
							Request changes
							Nothing changed but
App's view of 'file' is obsolete.

Can't we solve this by creating POST_WRITE async event and then use it like:

1) Set state to CHECKPOINT_PENDING
2) In state CHECKPOINT_PENDING we record all received modify events into a
   separate 'transition' stream.
3) Remove ignore marks we need to remove.
4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
   recorded to the new period.
5) Merge all events from 'transition' stream to both old and new period
   event streams.
6) Events get removed from the 'transition' stream only once we receive
   POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
   on crash recovery). This way some events from 'transition' stream may
   get merged to multiple period event streams if the checkpoints are
   frequent and writes take long.

This should avoid the above race, should be relatively lightweight, and
does not require major API extensions.

BTW, while thinking about this I was wondering: How are the applications
using persistent change journal going to deal with buffered vs direct IO? I
currently don't see a scheme that would not loose modifications for some
combinations...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-03 16:30                 ` Jan Kara
@ 2022-11-04  8:17                   ` Amir Goldstein
  2022-11-07 11:10                     ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-11-04  8:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Thu, Nov 3, 2022 at 6:30 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 28-10-22 15:50:04, Amir Goldstein wrote:
> > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > > Questions:
> > > > - What do you think about the direction this POC has taken so far?
> > > > - Is there anything specific that you would like to see in the POC
> > > >   to be convinced that this API will be useful?
> > >
> > > I think your POC is taking a good direction and your discussion with Dave
> > > had made me more confident that this is all workable :). I liked your idea
> > > of the wiki (or whatever form of documentation) that summarizes what we've
> > > discussed in this thread. That would be actually pretty nice for future
> > > reference.
> > >
> >
> > The current state of POC is that "populate of access" of both files
> > and directories is working and "race free evict of file content" is also
> > implemented (safely AFAIK).
> >
> > The technique involving exclusive write lease is discussed at [1].
> > In a nutshell, populate and evict synchronize on atomic i_writecount
> > and this technique can be implemented with upstream UAPIs.
>
> Not so much i_writecount AFAIU but the generic lease mechanism overall. But
> yes, the currently existing APIs should be enough for your purposes.
>

Right. Do note that the write lease is not reliable enough by itself
to provide exclusive access to the content, because:
1. The lease break signal is delivered asynchronously to content evict
    program
2. After the lease break timeout expires, reader will get access
    to the content even if content eviction is in progress

The actual strong exclusive access is provided by the sequence:
1. Open file for write
2. Deny future FAN_OPEN_PERM
3. Take write lease, but just to verify that i_writecount == 1

Notice one thing odd is that in do_dentry_open() the sequence is:
1. increment i_writecount
2. security_file_open() => FAN_OPEN_PERM
3. break_lease()

However, FAN_OPEN_PERM is blocking and when listener
reads the event, you get to:
created_fd() => ... do_dentry_open(f_mode=FMODE_NONOTIFY):
4. may increment i_writecount
5. security_file_open() => FAN_OPEN_PERM skipped
6. break_lease() => send lease break signal

The result is a bit non intuitive:

If a new open is attempted during content evict, the new open will be
blocked for the lease timeout, before the listener even gets a chance
to respond.

But if lease timeout has expired and the event listener denied the open,
the lease break signal will still be delivered to the content evict program,
despite the fact that the open is not going to proceed.

Nevertheless, as you say, the existing APIs work well for my purpose.

> > I did use persistent xattr marks for the POC, but this is not a must.
> > Evictable inode marks would have worked just as well.
>
> OK.
>
> > Now I have started to work on persistent change tracking.
> > For this, I have only kernel code, only lightly tested, but I did not
> > prove yet that the technique is working.
> >
> > The idea that I started to sketch at [2] is to alternate between two groups.
> >
> > When a change is recorded, an evictable ignore mark will be added on the
> > object.  To start recording changes from a new point in time
> > (checkpoint), a new group will be created (with no ignore marks) and the
> > old group will be closed.
>
> So what I dislike about the scheme with handover between two groups is that
> it is somewhat complex and furthermore requiring fs freezing for checkpoint
> is going to be rather expensive (and may be problematic if persistent
> change tracking is used by potentially many unpriviledged applications).
>
> As a side note I think it will be quite useful to be able to request
> checkpoint only for a subtree (e.g. some app may be interested only in a
> particular subtree) and the scheme with two groups will make any
> optimizations to benefit from such fact more difficult - either we create
> new group without ignore marks and then have to re-record changes nobody
> actually needs or we have to duplicate ignore marks which is potentially
> expensive as well.
>

For the records, checkpoint of a subtree is already requested by
git-fsmonitor hook and already implemented using inotify recursive
watches and in-memory change tracking by watchman and soon
to be implemented as a git built-in fsmonitor [1].

The known caveats are the need to do a "full crawl" after fsmonitor
service start or after event queue overflow and pinning of too many
inodes to cache.

Should we implement a systemd-fsmonitor service to serve
git-fsmonitor hooks of all the users in the system, it will solve the
scalability issue of recursive inode marks by using fanotify
filesystem marks.

It is understandable that one would expect the fsmonitor system
service checkpoints to align with the local user checkpoints (i.e. git status).
But this doesn't have to be this way.

Imagine that the system change tracking service takes only
nightly persistent checkpoints with fsfreeze regardless of the
users git fsmonitor queries.

The service can continue to use async events and in-memory
change tracking to provide accurate subtree change queries,
but in case of overflow/restart, instead of falling back to "full crawl",
the queries will fall back to "changes since last persistent checkpoint".

For some of our customers "full crawl" can take weeks.
We have been using this scheme to address scalability of "full crawl"
since the first version of Overlayfs snapshots in 2017.

So our customers incur a maintenance penalty of fsfreeze during
idle times (weekly by default) and in return, the worst case time for
changed files query was reduced from weeks to minutes.
It's all about making the right trade offs ;-)

[1] https://lore.kernel.org/all/pull.1352.git.git.1665326258.gitgitgadget@gmail.com/

> Let's think about the race:
>
> > To clarify, the race that I am trying to avoid is:
> > 1. group B got a pre modify event and recorded the change before time T
> > 2. The actual modification is performed after time T
> > 3. group A does not get a pre modify event, so does not record the change
> >     in the checkpoint since T
>
> AFAIU you are worried about:
>
> Task T                          Change journal          App
>
> write(file)
>   generate pre_modify event
>                                 record 'file' as modified
>                                                         Request changes
>                                                         Records 'file' contents
>   modify 'file' data
>
> ...
>                                                         Request changes
>                                                         Nothing changed but
> App's view of 'file' is obsolete.
>
> Can't we solve this by creating POST_WRITE async event and then use it like:
>

I like the idea of using POST_WRITE instead of holding sb_writers.

> 1) Set state to CHECKPOINT_PENDING
> 2) In state CHECKPOINT_PENDING we record all received modify events into a
>    separate 'transition' stream.
> 3) Remove ignore marks we need to remove.

Our customer use cases may have many millions of dirs.
I don't think this solution will be scalable, which is why I use the
alternating groups
to invalidate all the existing ignore marks at once.

But I agree that alternating groups should not be a requirement for HSM
and that for watching smaller subtrees, your suggestion makes more sense.

> 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
>    recorded to the new period.
> 5) Merge all events from 'transition' stream to both old and new period
>    event streams.
> 6) Events get removed from the 'transition' stream only once we receive
>    POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
>    on crash recovery). This way some events from 'transition' stream may
>    get merged to multiple period event streams if the checkpoints are
>    frequent and writes take long.
>
> This should avoid the above race, should be relatively lightweight, and
> does not require major API extensions.
>

If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups
is an implementation detail for the HSM.

PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed
for both the implementations (single group scheme needs to flush all
ignore marks with FAN_MARK_SYNC).

I am going to try to implement the PRE/POST_WRITE events and for
POC I may start with a single group because it may be easier or I may
implement both schemes, we'll see.

> BTW, while thinking about this I was wondering: How are the applications
> using persistent change journal going to deal with buffered vs direct IO? I
> currently don't see a scheme that would not loose modifications for some
> combinations...
>

This question is circling back to nfsd discussion about when to update
i_version and when to update mtime, before the change is observed
in core? on disk? both before and after?

My only answer to that is a hybrid combination of using async
events to track in-core changes and persistent change tracking
for on-disk changes.

fsfreeze before completion of the checkpoint is optional, but if
fsfreeze is performed then it should answer your question about
buffered vs direct IO and it makes the questions about in-memory
vs. on-disk moot.

Or maybe I did not understand the problem with buffered vs direct IO?
Can you give an example for losing modification that does involve
fsfreeze for completion of the checkpoint?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-04  8:17                   ` Amir Goldstein
@ 2022-11-07 11:10                     ` Jan Kara
  2022-11-07 14:13                       ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-07 11:10 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

On Fri 04-11-22 10:17:54, Amir Goldstein wrote:
> On Thu, Nov 3, 2022 at 6:30 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 28-10-22 15:50:04, Amir Goldstein wrote:
> > > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > > Questions:
> > > > > - What do you think about the direction this POC has taken so far?
> > > > > - Is there anything specific that you would like to see in the POC
> > > > >   to be convinced that this API will be useful?
> > > >
> > > > I think your POC is taking a good direction and your discussion with Dave
> > > > had made me more confident that this is all workable :). I liked your idea
> > > > of the wiki (or whatever form of documentation) that summarizes what we've
> > > > discussed in this thread. That would be actually pretty nice for future
> > > > reference.
> > > >
> > >
> > > The current state of POC is that "populate of access" of both files
> > > and directories is working and "race free evict of file content" is also
> > > implemented (safely AFAIK).
> > >
> > > The technique involving exclusive write lease is discussed at [1].
> > > In a nutshell, populate and evict synchronize on atomic i_writecount
> > > and this technique can be implemented with upstream UAPIs.
> >
> > Not so much i_writecount AFAIU but the generic lease mechanism overall. But
> > yes, the currently existing APIs should be enough for your purposes.
> >
> 
> Right. Do note that the write lease is not reliable enough by itself
> to provide exclusive access to the content, because:
> 1. The lease break signal is delivered asynchronously to content evict
>     program
> 2. After the lease break timeout expires, reader will get access
>     to the content even if content eviction is in progress
> 
> The actual strong exclusive access is provided by the sequence:
> 1. Open file for write
> 2. Deny future FAN_OPEN_PERM
> 3. Take write lease, but just to verify that i_writecount == 1
> 
> Notice one thing odd is that in do_dentry_open() the sequence is:
> 1. increment i_writecount
> 2. security_file_open() => FAN_OPEN_PERM
> 3. break_lease()
> 
> However, FAN_OPEN_PERM is blocking and when listener
> reads the event, you get to:
> created_fd() => ... do_dentry_open(f_mode=FMODE_NONOTIFY):
> 4. may increment i_writecount
> 5. security_file_open() => FAN_OPEN_PERM skipped
> 6. break_lease() => send lease break signal
> 
> The result is a bit non intuitive:
> 
> If a new open is attempted during content evict, the new open will be
> blocked for the lease timeout, before the listener even gets a chance
> to respond.
>
> But if lease timeout has expired and the event listener denied the open,
> the lease break signal will still be delivered to the content evict program,
> despite the fact that the open is not going to proceed.

I see. I'd just note that allowing FID mode for permission events would
solve both these problems, won't it?

> > > I did use persistent xattr marks for the POC, but this is not a must.
> > > Evictable inode marks would have worked just as well.
> >
> > OK.
> >
> > > Now I have started to work on persistent change tracking.
> > > For this, I have only kernel code, only lightly tested, but I did not
> > > prove yet that the technique is working.
> > >
> > > The idea that I started to sketch at [2] is to alternate between two groups.
> > >
> > > When a change is recorded, an evictable ignore mark will be added on the
> > > object.  To start recording changes from a new point in time
> > > (checkpoint), a new group will be created (with no ignore marks) and the
> > > old group will be closed.
> >
> > So what I dislike about the scheme with handover between two groups is that
> > it is somewhat complex and furthermore requiring fs freezing for checkpoint
> > is going to be rather expensive (and may be problematic if persistent
> > change tracking is used by potentially many unpriviledged applications).
> >
> > As a side note I think it will be quite useful to be able to request
> > checkpoint only for a subtree (e.g. some app may be interested only in a
> > particular subtree) and the scheme with two groups will make any
> > optimizations to benefit from such fact more difficult - either we create
> > new group without ignore marks and then have to re-record changes nobody
> > actually needs or we have to duplicate ignore marks which is potentially
> > expensive as well.
> >
> 
> For the records, checkpoint of a subtree is already requested by
> git-fsmonitor hook and already implemented using inotify recursive
> watches and in-memory change tracking by watchman and soon
> to be implemented as a git built-in fsmonitor [1].
> 
> The known caveats are the need to do a "full crawl" after fsmonitor
> service start or after event queue overflow and pinning of too many
> inodes to cache.
> 
> Should we implement a systemd-fsmonitor service to serve
> git-fsmonitor hooks of all the users in the system, it will solve the
> scalability issue of recursive inode marks by using fanotify
> filesystem marks.
> 
> It is understandable that one would expect the fsmonitor system
> service checkpoints to align with the local user checkpoints (i.e. git status).
> But this doesn't have to be this way.
> 
> Imagine that the system change tracking service takes only
> nightly persistent checkpoints with fsfreeze regardless of the
> users git fsmonitor queries.
> 
> The service can continue to use async events and in-memory
> change tracking to provide accurate subtree change queries,
> but in case of overflow/restart, instead of falling back to "full crawl",
> the queries will fall back to "changes since last persistent checkpoint".
> 
> For some of our customers "full crawl" can take weeks.
> We have been using this scheme to address scalability of "full crawl"
> since the first version of Overlayfs snapshots in 2017.
> 
> So our customers incur a maintenance penalty of fsfreeze during
> idle times (weekly by default) and in return, the worst case time for
> changed files query was reduced from weeks to minutes.
> It's all about making the right trade offs ;-)
> 
> [1] https://lore.kernel.org/all/pull.1352.git.git.1665326258.gitgitgadget@gmail.com/

Right. I just suspect that with systemd-fsmonitor when more applications
start using it, it will be more difficult to tune this as e.g. for
git-status I can imagine daily snapshots will not be that much useful (not
save that much time) and users would rather see something like hourly
snapshots or even more frequent at which point things become somewhat
inconvenient and performance heavy.  

So we don't have to implement this now but I'd prefer we keep this as a
possibilty for future API extension...

> > Let's think about the race:
> >
> > > To clarify, the race that I am trying to avoid is:
> > > 1. group B got a pre modify event and recorded the change before time T
> > > 2. The actual modification is performed after time T
> > > 3. group A does not get a pre modify event, so does not record the change
> > >     in the checkpoint since T
> >
> > AFAIU you are worried about:
> >
> > Task T                          Change journal          App
> >
> > write(file)
> >   generate pre_modify event
> >                                 record 'file' as modified
> >                                                         Request changes
> >                                                         Records 'file' contents
> >   modify 'file' data
> >
> > ...
> >                                                         Request changes
> >                                                         Nothing changed but
> > App's view of 'file' is obsolete.
> >
> > Can't we solve this by creating POST_WRITE async event and then use it like:
> >
> 
> I like the idea of using POST_WRITE instead of holding sb_writers.
> 
> > 1) Set state to CHECKPOINT_PENDING
> > 2) In state CHECKPOINT_PENDING we record all received modify events into a
> >    separate 'transition' stream.
> > 3) Remove ignore marks we need to remove.
> 
> Our customer use cases may have many millions of dirs.
> I don't think this solution will be scalable, which is why I use the
> alternating groups to invalidate all the existing ignore marks at once.

I see. Well, we could also extend FAN_MARK_FLUSH so that you can just
remove all ignore marks from a group so that you don't have to remove them
one-by-one and don't have to switch to a new group. In principle group
teardown does the same. It would allow large scale as well as small scale
users use very similar scheme with single group for switching periods.

> But I agree that alternating groups should not be a requirement for HSM
> and that for watching smaller subtrees, your suggestion makes more sense.
> 
> > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
> >    recorded to the new period.
> > 5) Merge all events from 'transition' stream to both old and new period
> >    event streams.
> > 6) Events get removed from the 'transition' stream only once we receive
> >    POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
> >    on crash recovery). This way some events from 'transition' stream may
> >    get merged to multiple period event streams if the checkpoints are
> >    frequent and writes take long.
> >
> > This should avoid the above race, should be relatively lightweight, and
> > does not require major API extensions.
> >
> 
> If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups
> is an implementation detail for the HSM.
>
> PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed
> for both the implementations (single group scheme needs to flush all
> ignore marks with FAN_MARK_SYNC).

So why would be FAN_MARK_SYNC needed for the single group scheme? From the
kernel POV the scheme I have proposed does not require any new API changes
besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for
more efficient removal of ignore marks. We don't even need the filesystem
freezing (modulo the buffered vs direct IO discussion below).

> I am going to try to implement the PRE/POST_WRITE events and for
> POC I may start with a single group because it may be easier or I may
> implement both schemes, we'll see.

That would be great. Thank you. Perhaps I'm missing something in my mental
model which would make things impractical :)

> > BTW, while thinking about this I was wondering: How are the applications
> > using persistent change journal going to deal with buffered vs direct IO? I
> > currently don't see a scheme that would not loose modifications for some
> > combinations...
> 
> This question is circling back to nfsd discussion about when to update
> i_version and when to update mtime, before the change is observed
> in core? on disk? both before and after?
> 
> My only answer to that is a hybrid combination of using async
> events to track in-core changes and persistent change tracking
> for on-disk changes.
> 
> fsfreeze before completion of the checkpoint is optional, but if
> fsfreeze is performed then it should answer your question about
> buffered vs direct IO and it makes the questions about in-memory
> vs. on-disk moot.
> 
> Or maybe I did not understand the problem with buffered vs direct IO?
> Can you give an example for losing modification that does involve
> fsfreeze for completion of the checkpoint?

You're right fsfreeze during period transision also solves the issue with
buffered vs direct IO and I'm not aware of any other reliable solution. It
just seems as a rather big (and expensive) hammer. Probably fine for
certain applications (such as daily system backup) but not so much for
others (such as each of your desktop apps wanting to efficiently check
for changes in their config directories on startup). So probably there's
market for both solutions with and without fsfreeze ;)

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-07 11:10                     ` Jan Kara
@ 2022-11-07 14:13                       ` Amir Goldstein
  2022-11-14 19:17                         ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-11-07 14:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Mon, Nov 7, 2022 at 1:10 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 04-11-22 10:17:54, Amir Goldstein wrote:
> > On Thu, Nov 3, 2022 at 6:30 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 28-10-22 15:50:04, Amir Goldstein wrote:
> > > > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > > Questions:
> > > > > > - What do you think about the direction this POC has taken so far?
> > > > > > - Is there anything specific that you would like to see in the POC
> > > > > >   to be convinced that this API will be useful?
> > > > >
> > > > > I think your POC is taking a good direction and your discussion with Dave
> > > > > had made me more confident that this is all workable :). I liked your idea
> > > > > of the wiki (or whatever form of documentation) that summarizes what we've
> > > > > discussed in this thread. That would be actually pretty nice for future
> > > > > reference.
> > > > >
> > > >
> > > > The current state of POC is that "populate of access" of both files
> > > > and directories is working and "race free evict of file content" is also
> > > > implemented (safely AFAIK).
> > > >
> > > > The technique involving exclusive write lease is discussed at [1].
> > > > In a nutshell, populate and evict synchronize on atomic i_writecount
> > > > and this technique can be implemented with upstream UAPIs.
> > >
> > > Not so much i_writecount AFAIU but the generic lease mechanism overall. But
> > > yes, the currently existing APIs should be enough for your purposes.
> > >
> >
> > Right. Do note that the write lease is not reliable enough by itself
> > to provide exclusive access to the content, because:
> > 1. The lease break signal is delivered asynchronously to content evict
> >     program
> > 2. After the lease break timeout expires, reader will get access
> >     to the content even if content eviction is in progress
> >
> > The actual strong exclusive access is provided by the sequence:
> > 1. Open file for write
> > 2. Deny future FAN_OPEN_PERM
> > 3. Take write lease, but just to verify that i_writecount == 1
> >
> > Notice one thing odd is that in do_dentry_open() the sequence is:
> > 1. increment i_writecount
> > 2. security_file_open() => FAN_OPEN_PERM
> > 3. break_lease()
> >
> > However, FAN_OPEN_PERM is blocking and when listener
> > reads the event, you get to:
> > created_fd() => ... do_dentry_open(f_mode=FMODE_NONOTIFY):
> > 4. may increment i_writecount
> > 5. security_file_open() => FAN_OPEN_PERM skipped
> > 6. break_lease() => send lease break signal
> >
> > The result is a bit non intuitive:
> >
> > If a new open is attempted during content evict, the new open will be
> > blocked for the lease timeout, before the listener even gets a chance
> > to respond.
> >
> > But if lease timeout has expired and the event listener denied the open,
> > the lease break signal will still be delivered to the content evict program,
> > despite the fact that the open is not going to proceed.
>
> I see. I'd just note that allowing FID mode for permission events would
> solve both these problems, won't it?
>

Yes, it would.
event_f_flags = O_PATH also does not have this wrinkle.

Originally, I had considered using FIDs in the permission events,
but I realized the event->fd is used as a key to the permission response,
so I would have had to replace event->fd with some response cookie.

Anyway, this oddity is not a problem for me now.

[...]

> > > Let's think about the race:
> > >
> > > > To clarify, the race that I am trying to avoid is:
> > > > 1. group B got a pre modify event and recorded the change before time T
> > > > 2. The actual modification is performed after time T
> > > > 3. group A does not get a pre modify event, so does not record the change
> > > >     in the checkpoint since T
> > >
> > > AFAIU you are worried about:
> > >
> > > Task T                          Change journal          App
> > >
> > > write(file)
> > >   generate pre_modify event
> > >                                 record 'file' as modified
> > >                                                         Request changes
> > >                                                         Records 'file' contents
> > >   modify 'file' data
> > >
> > > ...
> > >                                                         Request changes
> > >                                                         Nothing changed but
> > > App's view of 'file' is obsolete.
> > >
> > > Can't we solve this by creating POST_WRITE async event and then use it like:
> > >
> >
> > I like the idea of using POST_WRITE instead of holding sb_writers.
> >
> > > 1) Set state to CHECKPOINT_PENDING
> > > 2) In state CHECKPOINT_PENDING we record all received modify events into a
> > >    separate 'transition' stream.
> > > 3) Remove ignore marks we need to remove.
> >
> > Our customer use cases may have many millions of dirs.
> > I don't think this solution will be scalable, which is why I use the
> > alternating groups to invalidate all the existing ignore marks at once.
>
> I see. Well, we could also extend FAN_MARK_FLUSH so that you can just
> remove all ignore marks from a group so that you don't have to remove them
> one-by-one and don't have to switch to a new group. In principle group
> teardown does the same. It would allow large scale as well as small scale
> users use very similar scheme with single group for switching periods.
>

Maybe so, I need to try it to see if it can scale.

But note that in the alternating group scheme we do NOT need to wait for
the old group teardown to complete before returning the results of the query:
1. group T+1 subscribes to some events with FAN_MARK_SYNC
2. group T unsubscribes from those events

Step 2 can be done in the background.
The query which returns all the files modified between checkpoints T..T+1
can already return the changes recorded by group T while group T is
shutting down and cleaning up all the evictable marks.

> > But I agree that alternating groups should not be a requirement for HSM
> > and that for watching smaller subtrees, your suggestion makes more sense.
> >
> > > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
> > >    recorded to the new period.
> > > 5) Merge all events from 'transition' stream to both old and new period
> > >    event streams.
> > > 6) Events get removed from the 'transition' stream only once we receive
> > >    POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
> > >    on crash recovery). This way some events from 'transition' stream may
> > >    get merged to multiple period event streams if the checkpoints are
> > >    frequent and writes take long.
> > >
> > > This should avoid the above race, should be relatively lightweight, and
> > > does not require major API extensions.
> > >
> >
> > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups
> > is an implementation detail for the HSM.
> >
> > PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed
> > for both the implementations (single group scheme needs to flush all
> > ignore marks with FAN_MARK_SYNC).
>
> So why would be FAN_MARK_SYNC needed for the single group scheme? From the
> kernel POV the scheme I have proposed does not require any new API changes
> besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for
> more efficient removal of ignore marks. We don't even need the filesystem
> freezing (modulo the buffered vs direct IO discussion below).
>

Maybe I'm wrong, but my understanding is that after:
3) Remove ignore marks we need to remove.
a PRE_WRITE event may still be in send_to_group()
with one of the "removed" ignore marks and be ignored.

So it is not safe to:
4) Switch to new period & clear CHECKPOINT_PENDING

My understanding is that
synchronize_srcu(&fsnotify_mark_srcu);
is needed as barrier between 3) and 4)

In any case, even if CHECKPOINT_PENDING can work,
with or without FAN_MARK_SYNC, to me personally, understanding
the proof of correctness of alternating groups model is very easy,
while proving correctness for CHECKPOINT_PENDING model is
something that I was not yet able to accomplish.

> > I am going to try to implement the PRE/POST_WRITE events and for
> > POC I may start with a single group because it may be easier or I may
> > implement both schemes, we'll see.
>
> That would be great. Thank you. Perhaps I'm missing something in my mental
> model which would make things impractical :)
>

Me too. I won't know before I try.

FYI, at the moment I am considering not allowing independent
subscription for PRE/POST_XXX events, but only subscribe to
XXX events and a group with class FAN_CLASS_VFS_FILTER
will get both PRE/POST_XXX and won't be able to subscribe
to XXX events that do not have PRE_XXX events.

The rationale is that if a group subscribes to either PRE/POST_XXX
XXX() operation is not going to be on the fast path anyway.

This will make it easier to support more PRE/POST_XXX events
without using up all the remaining 32bits namespace.

Using the high 32bits of mask for PRE events and folding them
in the object interest mask with the low 32bit is another thing
that I was considering in case you would prefer to allow
independent subscription for PRE/POST_XXX events.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-07 14:13                       ` Amir Goldstein
@ 2022-11-14 19:17                         ` Jan Kara
  2022-11-14 20:08                           ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-14 19:17 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

On Mon 07-11-22 16:13:37, Amir Goldstein wrote:
> On Mon, Nov 7, 2022 at 1:10 PM Jan Kara <jack@suse.cz> wrote:
> > > > Let's think about the race:
> > > >
> > > > > To clarify, the race that I am trying to avoid is:
> > > > > 1. group B got a pre modify event and recorded the change before time T
> > > > > 2. The actual modification is performed after time T
> > > > > 3. group A does not get a pre modify event, so does not record the change
> > > > >     in the checkpoint since T
> > > >
> > > > AFAIU you are worried about:
> > > >
> > > > Task T                          Change journal          App
> > > >
> > > > write(file)
> > > >   generate pre_modify event
> > > >                                 record 'file' as modified
> > > >                                                         Request changes
> > > >                                                         Records 'file' contents
> > > >   modify 'file' data
> > > >
> > > > ...
> > > >                                                         Request changes
> > > >                                                         Nothing changed but
> > > > App's view of 'file' is obsolete.
> > > >
> > > > Can't we solve this by creating POST_WRITE async event and then use it like:
> > > >
> > >
> > > I like the idea of using POST_WRITE instead of holding sb_writers.
> > >
> > > > 1) Set state to CHECKPOINT_PENDING
> > > > 2) In state CHECKPOINT_PENDING we record all received modify events into a
> > > >    separate 'transition' stream.
> > > > 3) Remove ignore marks we need to remove.
> > >
> > > Our customer use cases may have many millions of dirs.
> > > I don't think this solution will be scalable, which is why I use the
> > > alternating groups to invalidate all the existing ignore marks at once.
> >
> > I see. Well, we could also extend FAN_MARK_FLUSH so that you can just
> > remove all ignore marks from a group so that you don't have to remove them
> > one-by-one and don't have to switch to a new group. In principle group
> > teardown does the same. It would allow large scale as well as small scale
> > users use very similar scheme with single group for switching periods.
> >
> 
> Maybe so, I need to try it to see if it can scale.
> 
> But note that in the alternating group scheme we do NOT need to wait for
> the old group teardown to complete before returning the results of the query:
> 1. group T+1 subscribes to some events with FAN_MARK_SYNC
> 2. group T unsubscribes from those events
> 
> Step 2 can be done in the background.
> The query which returns all the files modified between checkpoints T..T+1
> can already return the changes recorded by group T while group T is
> shutting down and cleaning up all the evictable marks.

I agree. With my scheme with a single group we first need to remove the
ignore marks before we can report events for T and start collecting events
for T+1. Total amount of work is going to be the same but latency of
reporting is going to be higher. I'd just think that it would not be too bad
but it needs to be measured. Also from the scheme you've described the
declaration of a checkpoint didn't seem like a latency sensitive operation?

> > > But I agree that alternating groups should not be a requirement for HSM
> > > and that for watching smaller subtrees, your suggestion makes more sense.
> > >
> > > > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
> > > >    recorded to the new period.
> > > > 5) Merge all events from 'transition' stream to both old and new period
> > > >    event streams.
> > > > 6) Events get removed from the 'transition' stream only once we receive
> > > >    POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
> > > >    on crash recovery). This way some events from 'transition' stream may
> > > >    get merged to multiple period event streams if the checkpoints are
> > > >    frequent and writes take long.
> > > >
> > > > This should avoid the above race, should be relatively lightweight, and
> > > > does not require major API extensions.
> > > >
> > >
> > > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups
> > > is an implementation detail for the HSM.
> > >
> > > PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed
> > > for both the implementations (single group scheme needs to flush all
> > > ignore marks with FAN_MARK_SYNC).
> >
> > So why would be FAN_MARK_SYNC needed for the single group scheme? From the
> > kernel POV the scheme I have proposed does not require any new API changes
> > besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for
> > more efficient removal of ignore marks. We don't even need the filesystem
> > freezing (modulo the buffered vs direct IO discussion below).
> >
> 
> Maybe I'm wrong, but my understanding is that after:
> 3) Remove ignore marks we need to remove.
> a PRE_WRITE event may still be in send_to_group()
> with one of the "removed" ignore marks and be ignored.
> 
> So it is not safe to:
> 4) Switch to new period & clear CHECKPOINT_PENDING

Well, but we'd still get the POST_WRITE event, evaluate this as a write
straddling checkpoint and include the file into the set of changed files
for checkpoint T+1 or later. So I don't think synchronize_srcu() is needed
anywhere?

> My understanding is that
> synchronize_srcu(&fsnotify_mark_srcu);
> is needed as barrier between 3) and 4)
> 
> In any case, even if CHECKPOINT_PENDING can work,
> with or without FAN_MARK_SYNC, to me personally, understanding
> the proof of correctness of alternating groups model is very easy,
> while proving correctness for CHECKPOINT_PENDING model is
> something that I was not yet able to accomplish.

I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I
don't find your scheme with two groups simpler ;) Let me try to write down
rationale for my scheme, I think I can even somewhat simplify it:

Write operation consists of:
generate PRE_WRITE on F
modify data of F
generate POST_WRITE on F

Checkpoint consists of:
clear ignore marks
report files for which we received PRE_WRITE or POST_WRITE until this
moment

And the invariant we try to provide is: If file F was modified during
checkpoint T, then we report F as modified during T or some later
checkpoint. Where it is matter of quality of implementation that "some
later checkpoint" isn't too much later than T but it depends on the
frequency of checkpoints, the length of notification queue, etc. so it is
difficult to give hard guarantees.

And the argument why the scheme maintains the invariant is that if
POST_WRITE is generated after "clear ignore marks" finishes, it will get
delivered and thus F will get reported as modified in some checkpoint once
the event is processed. If POST_WRITE gets generated before "clear ignore
marks" finishes and F is among ignored inodes, it means F is already in
modified set so it will get reported as part of checkpoint T. Also
application will already see modified data when processing list of modified
files in checkpoint T.

Simple and we don't even need PRE_WRITE here. But maybe you wanted to
provide stronger invariant? Like "you are not able to see modified data
without seeing F as modified?" But what exactly would be a guarantee here?
I feel I'm missing something here but I cannot quite grasp it at this
moment...

> > > I am going to try to implement the PRE/POST_WRITE events and for
> > > POC I may start with a single group because it may be easier or I may
> > > implement both schemes, we'll see.
> >
> > That would be great. Thank you. Perhaps I'm missing something in my mental
> > model which would make things impractical :)
> >
> 
> Me too. I won't know before I try.
> 
> FYI, at the moment I am considering not allowing independent
> subscription for PRE/POST_XXX events, but only subscribe to
> XXX events and a group with class FAN_CLASS_VFS_FILTER
> will get both PRE/POST_XXX and won't be able to subscribe
> to XXX events that do not have PRE_XXX events.
> 
> The rationale is that if a group subscribes to either PRE/POST_XXX
> XXX() operation is not going to be on the fast path anyway.
> 
> This will make it easier to support more PRE/POST_XXX events
> without using up all the remaining 32bits namespace.
> 
> Using the high 32bits of mask for PRE events and folding them
> in the object interest mask with the low 32bit is another thing
> that I was considering in case you would prefer to allow
> independent subscription for PRE/POST_XXX events.

So one question: Do you see anything else than POST_WRITE as being useful?
For directory operations it seems pointless as they hold i_rwsem exclusively
so I don't see useful distinction between PRE and POST event. For OPEN and
CLOSE I don't see use either. ACCESS might be the only one where PRE and
POST would both be useful for something.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-14 19:17                         ` Jan Kara
@ 2022-11-14 20:08                           ` Amir Goldstein
  2022-11-15 10:16                             ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-11-14 20:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 07-11-22 16:13:37, Amir Goldstein wrote:
> > On Mon, Nov 7, 2022 at 1:10 PM Jan Kara <jack@suse.cz> wrote:
> > > > > Let's think about the race:
> > > > >
> > > > > > To clarify, the race that I am trying to avoid is:
> > > > > > 1. group B got a pre modify event and recorded the change before time T
> > > > > > 2. The actual modification is performed after time T
> > > > > > 3. group A does not get a pre modify event, so does not record the change
> > > > > >     in the checkpoint since T
> > > > >
> > > > > AFAIU you are worried about:
> > > > >
> > > > > Task T                          Change journal          App
> > > > >
> > > > > write(file)
> > > > >   generate pre_modify event
> > > > >                                 record 'file' as modified
> > > > >                                                         Request changes
> > > > >                                                         Records 'file' contents
> > > > >   modify 'file' data
> > > > >
> > > > > ...
> > > > >                                                         Request changes
> > > > >                                                         Nothing changed but
> > > > > App's view of 'file' is obsolete.
> > > > >
> > > > > Can't we solve this by creating POST_WRITE async event and then use it like:
> > > > >
> > > >
> > > > I like the idea of using POST_WRITE instead of holding sb_writers.
> > > >
> > > > > 1) Set state to CHECKPOINT_PENDING
> > > > > 2) In state CHECKPOINT_PENDING we record all received modify events into a
> > > > >    separate 'transition' stream.
> > > > > 3) Remove ignore marks we need to remove.
> > > >
> > > > Our customer use cases may have many millions of dirs.
> > > > I don't think this solution will be scalable, which is why I use the
> > > > alternating groups to invalidate all the existing ignore marks at once.
> > >
> > > I see. Well, we could also extend FAN_MARK_FLUSH so that you can just
> > > remove all ignore marks from a group so that you don't have to remove them
> > > one-by-one and don't have to switch to a new group. In principle group
> > > teardown does the same. It would allow large scale as well as small scale
> > > users use very similar scheme with single group for switching periods.
> > >
> >
> > Maybe so, I need to try it to see if it can scale.
> >
> > But note that in the alternating group scheme we do NOT need to wait for
> > the old group teardown to complete before returning the results of the query:
> > 1. group T+1 subscribes to some events with FAN_MARK_SYNC
> > 2. group T unsubscribes from those events
> >
> > Step 2 can be done in the background.
> > The query which returns all the files modified between checkpoints T..T+1
> > can already return the changes recorded by group T while group T is
> > shutting down and cleaning up all the evictable marks.
>
> I agree. With my scheme with a single group we first need to remove the
> ignore marks before we can report events for T and start collecting events
> for T+1. Total amount of work is going to be the same but latency of
> reporting is going to be higher. I'd just think that it would not be too bad
> but it needs to be measured. Also from the scheme you've described the
> declaration of a checkpoint didn't seem like a latency sensitive operation?
>

I guess not. I will look into it.

> > > > But I agree that alternating groups should not be a requirement for HSM
> > > > and that for watching smaller subtrees, your suggestion makes more sense.
> > > >
> > > > > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
> > > > >    recorded to the new period.
> > > > > 5) Merge all events from 'transition' stream to both old and new period
> > > > >    event streams.
> > > > > 6) Events get removed from the 'transition' stream only once we receive
> > > > >    POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
> > > > >    on crash recovery). This way some events from 'transition' stream may
> > > > >    get merged to multiple period event streams if the checkpoints are
> > > > >    frequent and writes take long.
> > > > >
> > > > > This should avoid the above race, should be relatively lightweight, and
> > > > > does not require major API extensions.
> > > > >
> > > >
> > > > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups
> > > > is an implementation detail for the HSM.
> > > >
> > > > PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed
> > > > for both the implementations (single group scheme needs to flush all
> > > > ignore marks with FAN_MARK_SYNC).
> > >
> > > So why would be FAN_MARK_SYNC needed for the single group scheme? From the
> > > kernel POV the scheme I have proposed does not require any new API changes
> > > besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for
> > > more efficient removal of ignore marks. We don't even need the filesystem
> > > freezing (modulo the buffered vs direct IO discussion below).
> > >
> >
> > Maybe I'm wrong, but my understanding is that after:
> > 3) Remove ignore marks we need to remove.
> > a PRE_WRITE event may still be in send_to_group()
> > with one of the "removed" ignore marks and be ignored.
> >
> > So it is not safe to:
> > 4) Switch to new period & clear CHECKPOINT_PENDING
>
> Well, but we'd still get the POST_WRITE event, evaluate this as a write
> straddling checkpoint and include the file into the set of changed files
> for checkpoint T+1 or later. So I don't think synchronize_srcu() is needed
> anywhere?
>

You are probably right. I will double check.

> > My understanding is that
> > synchronize_srcu(&fsnotify_mark_srcu);
> > is needed as barrier between 3) and 4)
> >
> > In any case, even if CHECKPOINT_PENDING can work,
> > with or without FAN_MARK_SYNC, to me personally, understanding
> > the proof of correctness of alternating groups model is very easy,
> > while proving correctness for CHECKPOINT_PENDING model is
> > something that I was not yet able to accomplish.
>
> I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I
> don't find your scheme with two groups simpler ;) Let me try to write down
> rationale for my scheme, I think I can even somewhat simplify it:
>
> Write operation consists of:
> generate PRE_WRITE on F
> modify data of F
> generate POST_WRITE on F
>
> Checkpoint consists of:
> clear ignore marks
> report files for which we received PRE_WRITE or POST_WRITE until this
> moment
>
> And the invariant we try to provide is: If file F was modified during
> checkpoint T, then we report F as modified during T or some later
> checkpoint. Where it is matter of quality of implementation that "some
> later checkpoint" isn't too much later than T but it depends on the
> frequency of checkpoints, the length of notification queue, etc. so it is
> difficult to give hard guarantees.
>
> And the argument why the scheme maintains the invariant is that if
> POST_WRITE is generated after "clear ignore marks" finishes, it will get
> delivered and thus F will get reported as modified in some checkpoint once
> the event is processed. If POST_WRITE gets generated before "clear ignore
> marks" finishes and F is among ignored inodes, it means F is already in
> modified set so it will get reported as part of checkpoint T. Also
> application will already see modified data when processing list of modified
> files in checkpoint T.
>
> Simple and we don't even need PRE_WRITE here. But maybe you wanted to
> provide stronger invariant? Like "you are not able to see modified data
> without seeing F as modified?" But what exactly would be a guarantee here?
> I feel I'm missing something here but I cannot quite grasp it at this
> moment...
>

This is the very basic guarantee that the persistent change tracking snapshots
need to provide. If a crash happens after modification is done and before
modification is recorded, we won't detect the modification after reboot.

Maybe "checkpoint" was a bad name to use for this handover between
two subsequent change tracking snapshots.

> > > > I am going to try to implement the PRE/POST_WRITE events and for
> > > > POC I may start with a single group because it may be easier or I may
> > > > implement both schemes, we'll see.
> > >
> > > That would be great. Thank you. Perhaps I'm missing something in my mental
> > > model which would make things impractical :)
> > >
> >
> > Me too. I won't know before I try.
> >
> > FYI, at the moment I am considering not allowing independent
> > subscription for PRE/POST_XXX events, but only subscribe to
> > XXX events and a group with class FAN_CLASS_VFS_FILTER
> > will get both PRE/POST_XXX and won't be able to subscribe
> > to XXX events that do not have PRE_XXX events.
> >
> > The rationale is that if a group subscribes to either PRE/POST_XXX
> > XXX() operation is not going to be on the fast path anyway.
> >
> > This will make it easier to support more PRE/POST_XXX events
> > without using up all the remaining 32bits namespace.
> >
> > Using the high 32bits of mask for PRE events and folding them
> > in the object interest mask with the low 32bit is another thing
> > that I was considering in case you would prefer to allow
> > independent subscription for PRE/POST_XXX events.
>
> So one question: Do you see anything else than POST_WRITE as being useful?
> For directory operations it seems pointless as they hold i_rwsem exclusively
> so I don't see useful distinction between PRE and POST event. For OPEN and
> CLOSE I don't see use either. ACCESS might be the only one where PRE and
> POST would both be useful for something.
>

PRE_ACCESS is used to populate missing data and POST_ACCESS
is irrelevant for that.

PRE_MODIFY is used for something completely different, it is used
for the persistent change tracking and this has to be crash safe, so
exclusive i_rwsem has nothing to do with it.

PRE_MODIFY is called before i_rwsem and before sb_start_write()
so that HSM can record the change intent in the same filesystem
that the change is going to happen (this is a journaled record).

I feel I may have not explained this correctly.
Does this make sense?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-14 20:08                           ` Amir Goldstein
@ 2022-11-15 10:16                             ` Jan Kara
  2022-11-15 13:08                               ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-15 10:16 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

On Mon 14-11-22 22:08:16, Amir Goldstein wrote:
> On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@suse.cz> wrote:
> > > My understanding is that
> > > synchronize_srcu(&fsnotify_mark_srcu);
> > > is needed as barrier between 3) and 4)
> > >
> > > In any case, even if CHECKPOINT_PENDING can work,
> > > with or without FAN_MARK_SYNC, to me personally, understanding
> > > the proof of correctness of alternating groups model is very easy,
> > > while proving correctness for CHECKPOINT_PENDING model is
> > > something that I was not yet able to accomplish.
> >
> > I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I
> > don't find your scheme with two groups simpler ;) Let me try to write down
> > rationale for my scheme, I think I can even somewhat simplify it:
> >
> > Write operation consists of:
> > generate PRE_WRITE on F
> > modify data of F
> > generate POST_WRITE on F
> >
> > Checkpoint consists of:
> > clear ignore marks
> > report files for which we received PRE_WRITE or POST_WRITE until this
> > moment
> >
> > And the invariant we try to provide is: If file F was modified during
> > checkpoint T, then we report F as modified during T or some later
> > checkpoint. Where it is matter of quality of implementation that "some
> > later checkpoint" isn't too much later than T but it depends on the
> > frequency of checkpoints, the length of notification queue, etc. so it is
> > difficult to give hard guarantees.
> >
> > And the argument why the scheme maintains the invariant is that if
> > POST_WRITE is generated after "clear ignore marks" finishes, it will get
> > delivered and thus F will get reported as modified in some checkpoint once
> > the event is processed. If POST_WRITE gets generated before "clear ignore
> > marks" finishes and F is among ignored inodes, it means F is already in
> > modified set so it will get reported as part of checkpoint T. Also
> > application will already see modified data when processing list of modified
> > files in checkpoint T.
> >
> > Simple and we don't even need PRE_WRITE here. But maybe you wanted to
> > provide stronger invariant? Like "you are not able to see modified data
> > without seeing F as modified?" But what exactly would be a guarantee here?
> > I feel I'm missing something here but I cannot quite grasp it at this
> > moment...
> >
> 
> This is the very basic guarantee that the persistent change tracking snapshots
> need to provide. If a crash happens after modification is done and before
> modification is recorded, we won't detect the modification after reboot.

Right, crash safety was the point I was missing ;) Thanks for reminding me.
And now I also see why you use filesystem freezing - it is a way to make
things crash safe as otherwise it is difficult to guard against a race

generate PRE_WRITE for F
				PRE_WRITE ignored because file is already
					modified
				checkpoint happens -> F reported as modified
				  contents of F fetched

modify data
transaction commit
<crash>
				POST_WRITE never seen so change to F is
				  never reported

I just think filesystem freezing is too big hammer for widespread use of
persistent change tracking. Can't we introduce some SRCU lock / unlock into
file_start_write() / file_end_write() and then invoke synchronize_srcu()
during checkpoint after removing ignore marks? It will be much cheaper as
we don't have to flush all dirty data to disk and also writes can keep
flowing while we wait for outstanding writes straddling checkpoint to
complete. What do you think?

The checkpoint would then do:
start gathering changes for both T and T+1
clear ignore marks
synchronize_srcu()
stop gathering changes for T and report them

And in this case we would not need POST_WRITE as an event.

The technical problem I see is how to deal with AIO / io_uring because
SRCU needs to be released in the same context as it is acquired - that
would need to be consulted with Paul McKenney if we can make it work. And
another problem I see is that it might not be great to have this
system-wide as e.g. on networking filesystems or pipes writes can block for
really long.

Final question is how to expose this to userspace because this
functionality would seem useful outside of filesystem notification space so
probably do not need to tie it to that.

Or we could simplify our life somewhat and acquire SRCU when generating
PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU
within fsnotify and would mitigate the problems coming from system-wide
SRCU. OTOH it will create problems when PRE_WRITE gets generated and
POST_WRITE would not for some reason. Just branstorming here, I've not
really decided what's better...

> Maybe "checkpoint" was a bad name to use for this handover between
> two subsequent change tracking snapshots.

I'm getting used to the terminology :) But to me it still seems more
natural to look at the situation as a single stream of events where we fetch
bulks of changes at certain moments, rather than looking at it as certain
entities collecting events for different time intervals.

> > > > > I am going to try to implement the PRE/POST_WRITE events and for
> > > > > POC I may start with a single group because it may be easier or I may
> > > > > implement both schemes, we'll see.
> > > >
> > > > That would be great. Thank you. Perhaps I'm missing something in my mental
> > > > model which would make things impractical :)
> > > >
> > >
> > > Me too. I won't know before I try.
> > >
> > > FYI, at the moment I am considering not allowing independent
> > > subscription for PRE/POST_XXX events, but only subscribe to
> > > XXX events and a group with class FAN_CLASS_VFS_FILTER
> > > will get both PRE/POST_XXX and won't be able to subscribe
> > > to XXX events that do not have PRE_XXX events.
> > >
> > > The rationale is that if a group subscribes to either PRE/POST_XXX
> > > XXX() operation is not going to be on the fast path anyway.
> > >
> > > This will make it easier to support more PRE/POST_XXX events
> > > without using up all the remaining 32bits namespace.
> > >
> > > Using the high 32bits of mask for PRE events and folding them
> > > in the object interest mask with the low 32bit is another thing
> > > that I was considering in case you would prefer to allow
> > > independent subscription for PRE/POST_XXX events.
> >
> > So one question: Do you see anything else than POST_WRITE as being useful?
> > For directory operations it seems pointless as they hold i_rwsem exclusively
> > so I don't see useful distinction between PRE and POST event. For OPEN and
> > CLOSE I don't see use either. ACCESS might be the only one where PRE and
> > POST would both be useful for something.
> 
> PRE_ACCESS is used to populate missing data and POST_ACCESS
> is irrelevant for that.
> 
> PRE_MODIFY is used for something completely different, it is used
> for the persistent change tracking and this has to be crash safe, so
> exclusive i_rwsem has nothing to do with it.
> 
> PRE_MODIFY is called before i_rwsem and before sb_start_write()
> so that HSM can record the change intent in the same filesystem
> that the change is going to happen (this is a journaled record).
> 
> I feel I may have not explained this correctly.
> Does this make sense?

Yes, makes sense.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-15 10:16                             ` Jan Kara
@ 2022-11-15 13:08                               ` Amir Goldstein
  2022-11-16 10:56                                 ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-11-15 13:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Tue, Nov 15, 2022 at 12:16 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 14-11-22 22:08:16, Amir Goldstein wrote:
> > On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@suse.cz> wrote:
> > > > My understanding is that
> > > > synchronize_srcu(&fsnotify_mark_srcu);
> > > > is needed as barrier between 3) and 4)
> > > >
> > > > In any case, even if CHECKPOINT_PENDING can work,
> > > > with or without FAN_MARK_SYNC, to me personally, understanding
> > > > the proof of correctness of alternating groups model is very easy,
> > > > while proving correctness for CHECKPOINT_PENDING model is
> > > > something that I was not yet able to accomplish.
> > >
> > > I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I
> > > don't find your scheme with two groups simpler ;) Let me try to write down
> > > rationale for my scheme, I think I can even somewhat simplify it:
> > >
> > > Write operation consists of:
> > > generate PRE_WRITE on F
> > > modify data of F
> > > generate POST_WRITE on F
> > >
> > > Checkpoint consists of:
> > > clear ignore marks
> > > report files for which we received PRE_WRITE or POST_WRITE until this
> > > moment
> > >
> > > And the invariant we try to provide is: If file F was modified during
> > > checkpoint T, then we report F as modified during T or some later
> > > checkpoint. Where it is matter of quality of implementation that "some
> > > later checkpoint" isn't too much later than T but it depends on the
> > > frequency of checkpoints, the length of notification queue, etc. so it is
> > > difficult to give hard guarantees.
> > >
> > > And the argument why the scheme maintains the invariant is that if
> > > POST_WRITE is generated after "clear ignore marks" finishes, it will get
> > > delivered and thus F will get reported as modified in some checkpoint once
> > > the event is processed. If POST_WRITE gets generated before "clear ignore
> > > marks" finishes and F is among ignored inodes, it means F is already in
> > > modified set so it will get reported as part of checkpoint T. Also
> > > application will already see modified data when processing list of modified
> > > files in checkpoint T.
> > >
> > > Simple and we don't even need PRE_WRITE here. But maybe you wanted to
> > > provide stronger invariant? Like "you are not able to see modified data
> > > without seeing F as modified?" But what exactly would be a guarantee here?
> > > I feel I'm missing something here but I cannot quite grasp it at this
> > > moment...
> > >
> >
> > This is the very basic guarantee that the persistent change tracking snapshots
> > need to provide. If a crash happens after modification is done and before
> > modification is recorded, we won't detect the modification after reboot.
>
> Right, crash safety was the point I was missing ;) Thanks for reminding me.
> And now I also see why you use filesystem freezing - it is a way to make
> things crash safe as otherwise it is difficult to guard against a race
>
> generate PRE_WRITE for F
>                                 PRE_WRITE ignored because file is already
>                                         modified
>                                 checkpoint happens -> F reported as modified
>                                   contents of F fetched
>
> modify data
> transaction commit
> <crash>
>                                 POST_WRITE never seen so change to F is
>                                   never reported
>
> I just think filesystem freezing is too big hammer for widespread use of
> persistent change tracking.

Note that fsfreeze is also needed to flush dirty data after modify data,
not only to wait for modify data transaction commit.

Otherwise the fetched contents of F will differ from contents of F
after reboot even if we did wait for POST_WRITE.

However, in this case, file contents can be considered corrupted
and rsync, for example, will not detect the change either, because
mtime does match the previously fetched value.

As long as applications write files safely (with rename) fsfreeze is not
needed, but for strict change tracking, fsfreeze is needed, so fsfreeze
is a policy decision of HSM.

> Can't we introduce some SRCU lock / unlock into
> file_start_write() / file_end_write() and then invoke synchronize_srcu()
> during checkpoint after removing ignore marks? It will be much cheaper as
> we don't have to flush all dirty data to disk and also writes can keep
> flowing while we wait for outstanding writes straddling checkpoint to
> complete. What do you think?

Maybe, but this is not enough.
Note that my patches [1] are overlapping fsnotify_mark_srcu with
file_start_write(), so we would need to overlay fsnotify_mark_srcu
with this new "modify SRCU".

[1] https://github.com/amir73il/linux/commits/fanotify_pre_content

>
> The checkpoint would then do:
> start gathering changes for both T and T+1
> clear ignore marks
> synchronize_srcu()
> stop gathering changes for T and report them
>
> And in this case we would not need POST_WRITE as an event.
>

Why then give up on the POST_WRITE events idea?
Don't you think it could work?
Or is it just because you think the generic API would be useful to others?

> The technical problem I see is how to deal with AIO / io_uring because
> SRCU needs to be released in the same context as it is acquired - that
> would need to be consulted with Paul McKenney if we can make it work. And
> another problem I see is that it might not be great to have this
> system-wide as e.g. on networking filesystems or pipes writes can block for
> really long.
>
> Final question is how to expose this to userspace because this
> functionality would seem useful outside of filesystem notification space so
> probably do not need to tie it to that.
>
> Or we could simplify our life somewhat and acquire SRCU when generating
> PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU
> within fsnotify and would mitigate the problems coming from system-wide
> SRCU. OTOH it will create problems when PRE_WRITE gets generated and
> POST_WRITE would not for some reason. Just branstorming here, I've not
> really decided what's better...
>

What if checkpoint only acquired (and released) exclusive sb_writers without
flushing dirty data.
Wouldn't that be equivalent to the synchronize_srcu() you suggested?

> > Maybe "checkpoint" was a bad name to use for this handover between
> > two subsequent change tracking snapshots.
>
> I'm getting used to the terminology :) But to me it still seems more
> natural to look at the situation as a single stream of events where we fetch
> bulks of changes at certain moments, rather than looking at it as certain
> entities collecting events for different time intervals.
>

I always used "snapshot take" as terminology.
I just now started to use "checkpoint" for this userspace HSM implementation.

I have no objection to single stream, nor to "flush all evictable" command.
I will try to start with this implementation for POC.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-15 13:08                               ` Amir Goldstein
@ 2022-11-16 10:56                                 ` Jan Kara
  2022-11-16 16:24                                   ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-16 10:56 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

On Tue 15-11-22 15:08:17, Amir Goldstein wrote:
> On Tue, Nov 15, 2022 at 12:16 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Mon 14-11-22 22:08:16, Amir Goldstein wrote:
> > > On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@suse.cz> wrote:
> > > > > My understanding is that
> > > > > synchronize_srcu(&fsnotify_mark_srcu);
> > > > > is needed as barrier between 3) and 4)
> > > > >
> > > > > In any case, even if CHECKPOINT_PENDING can work,
> > > > > with or without FAN_MARK_SYNC, to me personally, understanding
> > > > > the proof of correctness of alternating groups model is very easy,
> > > > > while proving correctness for CHECKPOINT_PENDING model is
> > > > > something that I was not yet able to accomplish.
> > > >
> > > > I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I
> > > > don't find your scheme with two groups simpler ;) Let me try to write down
> > > > rationale for my scheme, I think I can even somewhat simplify it:
> > > >
> > > > Write operation consists of:
> > > > generate PRE_WRITE on F
> > > > modify data of F
> > > > generate POST_WRITE on F
> > > >
> > > > Checkpoint consists of:
> > > > clear ignore marks
> > > > report files for which we received PRE_WRITE or POST_WRITE until this
> > > > moment
> > > >
> > > > And the invariant we try to provide is: If file F was modified during
> > > > checkpoint T, then we report F as modified during T or some later
> > > > checkpoint. Where it is matter of quality of implementation that "some
> > > > later checkpoint" isn't too much later than T but it depends on the
> > > > frequency of checkpoints, the length of notification queue, etc. so it is
> > > > difficult to give hard guarantees.
> > > >
> > > > And the argument why the scheme maintains the invariant is that if
> > > > POST_WRITE is generated after "clear ignore marks" finishes, it will get
> > > > delivered and thus F will get reported as modified in some checkpoint once
> > > > the event is processed. If POST_WRITE gets generated before "clear ignore
> > > > marks" finishes and F is among ignored inodes, it means F is already in
> > > > modified set so it will get reported as part of checkpoint T. Also
> > > > application will already see modified data when processing list of modified
> > > > files in checkpoint T.
> > > >
> > > > Simple and we don't even need PRE_WRITE here. But maybe you wanted to
> > > > provide stronger invariant? Like "you are not able to see modified data
> > > > without seeing F as modified?" But what exactly would be a guarantee here?
> > > > I feel I'm missing something here but I cannot quite grasp it at this
> > > > moment...
> > > >
> > >
> > > This is the very basic guarantee that the persistent change tracking snapshots
> > > need to provide. If a crash happens after modification is done and before
> > > modification is recorded, we won't detect the modification after reboot.
> >
> > Right, crash safety was the point I was missing ;) Thanks for reminding me.
> > And now I also see why you use filesystem freezing - it is a way to make
> > things crash safe as otherwise it is difficult to guard against a race
> >
> > generate PRE_WRITE for F
> >                                 PRE_WRITE ignored because file is already
> >                                         modified
> >                                 checkpoint happens -> F reported as modified
> >                                   contents of F fetched
> >
> > modify data
> > transaction commit
> > <crash>
> >                                 POST_WRITE never seen so change to F is
> >                                   never reported
> >
> > I just think filesystem freezing is too big hammer for widespread use of
> > persistent change tracking.
> 
> Note that fsfreeze is also needed to flush dirty data after modify data,
> not only to wait for modify data transaction commit.
> 
> Otherwise the fetched contents of F will differ from contents of F
> after reboot even if we did wait for POST_WRITE.
> 
> However, in this case, file contents can be considered corrupted
> and rsync, for example, will not detect the change either, because
> mtime does match the previously fetched value.
> 
> As long as applications write files safely (with rename) fsfreeze is not
> needed, but for strict change tracking, fsfreeze is needed, so fsfreeze
> is a policy decision of HSM.

Yeah, I agree. It is a policy decision.

> > Can't we introduce some SRCU lock / unlock into
> > file_start_write() / file_end_write() and then invoke synchronize_srcu()
> > during checkpoint after removing ignore marks? It will be much cheaper as
> > we don't have to flush all dirty data to disk and also writes can keep
> > flowing while we wait for outstanding writes straddling checkpoint to
> > complete. What do you think?
> 
> Maybe, but this is not enough.
> Note that my patches [1] are overlapping fsnotify_mark_srcu with
> file_start_write(), so we would need to overlay fsnotify_mark_srcu
> with this new "modify SRCU".
> 
> [1] https://github.com/amir73il/linux/commits/fanotify_pre_content

Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather
have the "modify SRCU" cover the whole region we need - which means
including the generation of PRE_MODIFY event.

> > The checkpoint would then do:
> > start gathering changes for both T and T+1
> > clear ignore marks
> > synchronize_srcu()
> > stop gathering changes for T and report them
> >
> > And in this case we would not need POST_WRITE as an event.
> >
> 
> Why then give up on the POST_WRITE events idea?
> Don't you think it could work?

So as we are discussing, the POST_WRITE event is not useful when we want to
handle crash safety. And if we have some other mechanism (like SRCU) which
is able to guarantee crash safety, then what is the benefit of POST_WRITE?
I'm not against POST_WRITE, I just don't see much value in it if we have
another mechanism to deal with events straddling checkpoint.

> > The technical problem I see is how to deal with AIO / io_uring because
> > SRCU needs to be released in the same context as it is acquired - that
> > would need to be consulted with Paul McKenney if we can make it work. And
> > another problem I see is that it might not be great to have this
> > system-wide as e.g. on networking filesystems or pipes writes can block for
> > really long.
> >
> > Final question is how to expose this to userspace because this
> > functionality would seem useful outside of filesystem notification space so
> > probably do not need to tie it to that.
> >
> > Or we could simplify our life somewhat and acquire SRCU when generating
> > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU
> > within fsnotify and would mitigate the problems coming from system-wide
> > SRCU. OTOH it will create problems when PRE_WRITE gets generated and
> > POST_WRITE would not for some reason. Just branstorming here, I've not
> > really decided what's better...
> >
> 
> What if checkpoint only acquired (and released) exclusive sb_writers without
> flushing dirty data.
> Wouldn't that be equivalent to the synchronize_srcu() you suggested?

In terms of guarantees it would be equivalent. In terms of impact on the
system it will be considerably worse. Because SRCU allows new SRCU readers
to start while synchronize_srcu() is running - so in our case new writes
can freely run while we are waiting for pending writes to complete. So
impact of the synchronize_srcu() on system activity will be practically
unnoticeable. If we use sb_writers as you suggest, it will block all write
activity until all writes finish. Which can be significant amount of time
if you have e.g. write(1 GB of data) running.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-16 10:56                                 ` Jan Kara
@ 2022-11-16 16:24                                   ` Amir Goldstein
  2022-11-17 12:38                                     ` Amir Goldstein
                                                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-11-16 16:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

> > > Can't we introduce some SRCU lock / unlock into
> > > file_start_write() / file_end_write() and then invoke synchronize_srcu()
> > > during checkpoint after removing ignore marks? It will be much cheaper as
> > > we don't have to flush all dirty data to disk and also writes can keep
> > > flowing while we wait for outstanding writes straddling checkpoint to
> > > complete. What do you think?
> >
> > Maybe, but this is not enough.
> > Note that my patches [1] are overlapping fsnotify_mark_srcu with
> > file_start_write(), so we would need to overlay fsnotify_mark_srcu
> > with this new "modify SRCU".
> >
> > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content
>
> Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather
> have the "modify SRCU" cover the whole region we need - which means
> including the generation of PRE_MODIFY event.
>

Yeh, it would be great if we can pull this off.

> > > The checkpoint would then do:
> > > start gathering changes for both T and T+1
> > > clear ignore marks
> > > synchronize_srcu()
> > > stop gathering changes for T and report them
> > >
> > > And in this case we would not need POST_WRITE as an event.
> > >
> >
> > Why then give up on the POST_WRITE events idea?
> > Don't you think it could work?
>
> So as we are discussing, the POST_WRITE event is not useful when we want to
> handle crash safety. And if we have some other mechanism (like SRCU) which
> is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> I'm not against POST_WRITE, I just don't see much value in it if we have
> another mechanism to deal with events straddling checkpoint.
>

Not sure I follow.

I think that crash safety can be achieved also with PRE/POST_WRITE:
- PRE_WRITE records an intent to write in persistent snapshot T
  and add to in-memory map of in-progress writes of period T
- When "checkpoint T" starts, new PRE_WRITES are recorded in both
  T and T+1 persistent snapshots, but event is added only to
  in-memory map of in-progress writes of period T+1
- "checkpoint T" ends when all in-progress writes of T are completed

The trick with alternating snapshots "handover" is this
(perhaps I never explained it and I need to elaborate on the wiki [1]):

[1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query

The changed files query results need to include recorded changes in both
"finalizing" snapshot T and the new snapshot T+1 that was started in
the beginning of the query.

Snapshot T MUST NOT be discarded until checkpoint/handover
is complete AND the query results that contain changes recorded
in T and T+1 snapshots have been consumed.

When the consumer ACKs that the query results have been safely stored
or acted upon (I called this operation "bless" snapshot T+1) then and
only then can snapshot T be discarded.

After snapshot T is discarded a new query will start snapshot T+2.
A changed files query result includes the id of the last blessed snapshot.

I think this is more or less equivalent to the SRCU that you suggested,
but all the work is done in userspace at application level.

If you see any problem with this scheme or don't understand it
please let me know and I will try to explain better.


> > > The technical problem I see is how to deal with AIO / io_uring because
> > > SRCU needs to be released in the same context as it is acquired - that
> > > would need to be consulted with Paul McKenney if we can make it work. And
> > > another problem I see is that it might not be great to have this
> > > system-wide as e.g. on networking filesystems or pipes writes can block for
> > > really long.
> > >
> > > Final question is how to expose this to userspace because this
> > > functionality would seem useful outside of filesystem notification space so
> > > probably do not need to tie it to that.
> > >
> > > Or we could simplify our life somewhat and acquire SRCU when generating
> > > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU
> > > within fsnotify and would mitigate the problems coming from system-wide
> > > SRCU. OTOH it will create problems when PRE_WRITE gets generated and
> > > POST_WRITE would not for some reason. Just branstorming here, I've not
> > > really decided what's better...
> > >

Seems there are several non trivial challenges to surmount with this
"userspace modification SRCU" idea.

For now, I will stay in my comfort zone and try to make the POC
with PRE/POST_WRITE work and write the proof of correctness.

I will have no objection at all if you figure out how to solve those
issues and guide me to a path for implementing sb_write_srcu.
It will make the userspace implementation much simpler, getting rid
of the in-progress writes in-memory tracking.

> >
> > What if checkpoint only acquired (and released) exclusive sb_writers without
> > flushing dirty data.
> > Wouldn't that be equivalent to the synchronize_srcu() you suggested?
>
> In terms of guarantees it would be equivalent. In terms of impact on the
> system it will be considerably worse. Because SRCU allows new SRCU readers
> to start while synchronize_srcu() is running - so in our case new writes
> can freely run while we are waiting for pending writes to complete. So
> impact of the synchronize_srcu() on system activity will be practically
> unnoticeable. If we use sb_writers as you suggest, it will block all write
> activity until all writes finish. Which can be significant amount of time
> if you have e.g. write(1 GB of data) running.
>

Of course, it was a silly idea.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-16 16:24                                   ` Amir Goldstein
@ 2022-11-17 12:38                                     ` Amir Goldstein
  2022-11-23 10:49                                       ` Jan Kara
  2022-11-21 16:40                                     ` Amir Goldstein
  2022-11-23 10:10                                     ` Jan Kara
  2 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-11-17 12:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

> > > > The checkpoint would then do:
> > > > start gathering changes for both T and T+1
> > > > clear ignore marks
> > > > synchronize_srcu()
> > > > stop gathering changes for T and report them
> > > >
> > > > And in this case we would not need POST_WRITE as an event.
> > > >
> > >
> > > Why then give up on the POST_WRITE events idea?
> > > Don't you think it could work?
> >
> > So as we are discussing, the POST_WRITE event is not useful when we want to
> > handle crash safety. And if we have some other mechanism (like SRCU) which
> > is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> > I'm not against POST_WRITE, I just don't see much value in it if we have
> > another mechanism to deal with events straddling checkpoint.
> >
>
> Not sure I follow.
>
> I think that crash safety can be achieved also with PRE/POST_WRITE:
> - PRE_WRITE records an intent to write in persistent snapshot T
>   and add to in-memory map of in-progress writes of period T
> - When "checkpoint T" starts, new PRE_WRITES are recorded in both
>   T and T+1 persistent snapshots, but event is added only to
>   in-memory map of in-progress writes of period T+1
> - "checkpoint T" ends when all in-progress writes of T are completed
>
> The trick with alternating snapshots "handover" is this
> (perhaps I never explained it and I need to elaborate on the wiki [1]):
>
> [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query
>
> The changed files query results need to include recorded changes in both
> "finalizing" snapshot T and the new snapshot T+1 that was started in
> the beginning of the query.
>
> Snapshot T MUST NOT be discarded until checkpoint/handover
> is complete AND the query results that contain changes recorded
> in T and T+1 snapshots have been consumed.
>
> When the consumer ACKs that the query results have been safely stored
> or acted upon (I called this operation "bless" snapshot T+1) then and
> only then can snapshot T be discarded.
>
> After snapshot T is discarded a new query will start snapshot T+2.
> A changed files query result includes the id of the last blessed snapshot.
>
> I think this is more or less equivalent to the SRCU that you suggested,
> but all the work is done in userspace at application level.
>
> If you see any problem with this scheme or don't understand it
> please let me know and I will try to explain better.
>

Hmm I guess "crash safety" is not well defined.
You and I were talking about "system crash" and indeed, this was
my only concern with kernel implementation of overlayfs watch.

But with userspace HSM service, how can we guarantee that
modifications did not happen while the service is down?

I don't really have a good answer for this.

Thinking out loud, we would somehow need to make the default
permission deny for all modifications, maybe through some mount
property (e.g. MOUNT_ATTR_PROT_READ), causing the pre-write
hooks to default to EROFS if there is no "vfs filter" mount mark.

Then it will be possible to expose a "safe" mount to users, where
modifications can never go unnoticed even when HSM service
crashes.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-16 16:24                                   ` Amir Goldstein
  2022-11-17 12:38                                     ` Amir Goldstein
@ 2022-11-21 16:40                                     ` Amir Goldstein
  2022-11-23 12:11                                       ` Jan Kara
  2022-11-23 10:10                                     ` Jan Kara
  2 siblings, 1 reply; 43+ messages in thread
From: Amir Goldstein @ 2022-11-21 16:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Wed, Nov 16, 2022 at 6:24 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > Can't we introduce some SRCU lock / unlock into
> > > > file_start_write() / file_end_write() and then invoke synchronize_srcu()
> > > > during checkpoint after removing ignore marks? It will be much cheaper as
> > > > we don't have to flush all dirty data to disk and also writes can keep
> > > > flowing while we wait for outstanding writes straddling checkpoint to
> > > > complete. What do you think?
> > >
> > > Maybe, but this is not enough.
> > > Note that my patches [1] are overlapping fsnotify_mark_srcu with
> > > file_start_write(), so we would need to overlay fsnotify_mark_srcu
> > > with this new "modify SRCU".
> > >
> > > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content
> >
> > Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather
> > have the "modify SRCU" cover the whole region we need - which means
> > including the generation of PRE_MODIFY event.
> >
>
> Yeh, it would be great if we can pull this off.

OK. I decided to give this a shot, see:

https://github.com/amir73il/linux/commits/sb_write_barrier

It is just a sketch to show the idea, very lightly tested.
What I did is, instead of converting all the sb_start,end_write()
call sites, which would be a huge change, only callers of
sb_start,end_write_srcu() participate in the "modify SRCU".

I then converted all the dir modify call sites and some other
call sites to use helpers that take SRCU and call pre-modify hooks.

[...]

> > > > The technical problem I see is how to deal with AIO / io_uring because
> > > > SRCU needs to be released in the same context as it is acquired - that
> > > > would need to be consulted with Paul McKenney if we can make it work. And
> > > > another problem I see is that it might not be great to have this
> > > > system-wide as e.g. on networking filesystems or pipes writes can block for
> > > > really long.
> > > >

I averted this problem for now - file data writes are not covered by
s_write_srcu with my POC branch.

The rationale is that with file data write, HSM would anyway need to
use fsfreeze() to get any guarantee, so maybe s_write_srcu is not really
useful here??

It might be useful to use s_write_srcu to cover the pre-modify event
up to after sb_start_write() in file write/aio/io_uring, but not byond it,
so that sb_write_barrier()+fsfreeze() will provide full coverage for
in-progress writes.

Please let me know if this plan sounds reasonable.

> > > > Final question is how to expose this to userspace because this
> > > > functionality would seem useful outside of filesystem notification space so
> > > > probably do not need to tie it to that.

In the current POC branch, nothing calls sb_write_barrier() yet.
I was thinking of using it when flushing marks, maybe with the
FAN_MARK_SYNC flag that I proposed.

For general purpose API, the semantics would need to be better
defined, as with this "opt-in" implementation, only some of the
modification operations are covered by the 'sb write barrier'.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-16 16:24                                   ` Amir Goldstein
  2022-11-17 12:38                                     ` Amir Goldstein
  2022-11-21 16:40                                     ` Amir Goldstein
@ 2022-11-23 10:10                                     ` Jan Kara
  2022-11-23 15:16                                       ` Amir Goldstein
  2 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-23 10:10 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

On Wed 16-11-22 18:24:06, Amir Goldstein wrote:
> > > Why then give up on the POST_WRITE events idea?
> > > Don't you think it could work?
> >
> > So as we are discussing, the POST_WRITE event is not useful when we want to
> > handle crash safety. And if we have some other mechanism (like SRCU) which
> > is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> > I'm not against POST_WRITE, I just don't see much value in it if we have
> > another mechanism to deal with events straddling checkpoint.
> >
> 
> Not sure I follow.
> 
> I think that crash safety can be achieved also with PRE/POST_WRITE:
> - PRE_WRITE records an intent to write in persistent snapshot T
>   and add to in-memory map of in-progress writes of period T
> - When "checkpoint T" starts, new PRE_WRITES are recorded in both
>   T and T+1 persistent snapshots, but event is added only to
>   in-memory map of in-progress writes of period T+1
> - "checkpoint T" ends when all in-progress writes of T are completed

So maybe I miss something but suppose the situation I was mentioning few
emails earlier:

PRE_WRITE for F			-> F recorded as modified in T
modify F
POST_WRITE for F

PRE_WRITE for F			-> ignored because F is already marked as
				   modified

				-> checkpoint T requested, modified files
				   reported, process modified files
modify F
--------- crash

Now unless filesystem freeze or SRCU is part of checkpoint, we will never
notify about the last modification to F. So I don't see how PRE +
POST_WRITE alone can achieve crash safety...

And if we use filesystem freeze or SRCU as part of checkpoint, then
processing of POST_WRITE events does not give us anything new. E.g.
synchronize_srcu() during checkpoing before handing out list of modified
files makes sure all modifications to files for which PRE_MODIFY events
were generated (and thus are listed as modified in checkpoint T) are
visible for userspace.

So am I missing some case where POST_WRITE would be more useful than SRCU?
Because at this point I'd rather implement SRCU than POST_WRITE.

> The trick with alternating snapshots "handover" is this
> (perhaps I never explained it and I need to elaborate on the wiki [1]):
> 
> [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query
> 
> The changed files query results need to include recorded changes in both
> "finalizing" snapshot T and the new snapshot T+1 that was started in
> the beginning of the query.
> 
> Snapshot T MUST NOT be discarded until checkpoint/handover
> is complete AND the query results that contain changes recorded
> in T and T+1 snapshots have been consumed.
> 
> When the consumer ACKs that the query results have been safely stored
> or acted upon (I called this operation "bless" snapshot T+1) then and
> only then can snapshot T be discarded.
> 
> After snapshot T is discarded a new query will start snapshot T+2.
> A changed files query result includes the id of the last blessed snapshot.
> 
> I think this is more or less equivalent to the SRCU that you suggested,
> but all the work is done in userspace at application level.
> 
> If you see any problem with this scheme or don't understand it
> please let me know and I will try to explain better.

So until now I was imagining that query results will be returned like a one
big memcpy. I.e. one off event where the "persistent log daemon" hands over
the whole contents of checkpoint T to the client. Whatever happens with the
returned data is the bussiness of the client, whatever happens with the
checkpoint T records in the daemon is the daemon's bussiness. The model you
seem to speak about here is somewhat different - more like readdir() kind
of approach where client asks for access to checkpoint T data, daemon
provides the data record by record (probably serving the data from its
files on disk), and when the client is done and "closes" checkpoint T,
daemon's records about checkpoint T can be erased. Am I getting it right?

This however seems somewhat orthogonal to the SRCU idea. SRCU essentially
serves the only purpose - make sure that modifications to all files for
which we have received PRE_WRITE event are visible in respective files.

> > > > The technical problem I see is how to deal with AIO / io_uring because
> > > > SRCU needs to be released in the same context as it is acquired - that
> > > > would need to be consulted with Paul McKenney if we can make it work. And
> > > > another problem I see is that it might not be great to have this
> > > > system-wide as e.g. on networking filesystems or pipes writes can block for
> > > > really long.
> > > >
> > > > Final question is how to expose this to userspace because this
> > > > functionality would seem useful outside of filesystem notification space so
> > > > probably do not need to tie it to that.
> > > >
> > > > Or we could simplify our life somewhat and acquire SRCU when generating
> > > > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU
> > > > within fsnotify and would mitigate the problems coming from system-wide
> > > > SRCU. OTOH it will create problems when PRE_WRITE gets generated and
> > > > POST_WRITE would not for some reason. Just branstorming here, I've not
> > > > really decided what's better...
> 
> Seems there are several non trivial challenges to surmount with this
> "userspace modification SRCU" idea.
> 
> For now, I will stay in my comfort zone and try to make the POC
> with PRE/POST_WRITE work and write the proof of correctness.
> 
> I will have no objection at all if you figure out how to solve those
> issues and guide me to a path for implementing sb_write_srcu.
> It will make the userspace implementation much simpler, getting rid
> of the in-progress writes in-memory tracking.

It seems you have progressed on this front yourself so let's continue there
:).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-17 12:38                                     ` Amir Goldstein
@ 2022-11-23 10:49                                       ` Jan Kara
  2022-11-23 13:07                                         ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-23 10:49 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

On Thu 17-11-22 14:38:51, Amir Goldstein wrote:
> > > > > The checkpoint would then do:
> > > > > start gathering changes for both T and T+1
> > > > > clear ignore marks
> > > > > synchronize_srcu()
> > > > > stop gathering changes for T and report them
> > > > >
> > > > > And in this case we would not need POST_WRITE as an event.
> > > > >
> > > >
> > > > Why then give up on the POST_WRITE events idea?
> > > > Don't you think it could work?
> > >
> > > So as we are discussing, the POST_WRITE event is not useful when we want to
> > > handle crash safety. And if we have some other mechanism (like SRCU) which
> > > is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> > > I'm not against POST_WRITE, I just don't see much value in it if we have
> > > another mechanism to deal with events straddling checkpoint.
> > >
> >
> > Not sure I follow.
> >
> > I think that crash safety can be achieved also with PRE/POST_WRITE:
> > - PRE_WRITE records an intent to write in persistent snapshot T
> >   and add to in-memory map of in-progress writes of period T
> > - When "checkpoint T" starts, new PRE_WRITES are recorded in both
> >   T and T+1 persistent snapshots, but event is added only to
> >   in-memory map of in-progress writes of period T+1
> > - "checkpoint T" ends when all in-progress writes of T are completed
> >
> > The trick with alternating snapshots "handover" is this
> > (perhaps I never explained it and I need to elaborate on the wiki [1]):
> >
> > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query
> >
> > The changed files query results need to include recorded changes in both
> > "finalizing" snapshot T and the new snapshot T+1 that was started in
> > the beginning of the query.
> >
> > Snapshot T MUST NOT be discarded until checkpoint/handover
> > is complete AND the query results that contain changes recorded
> > in T and T+1 snapshots have been consumed.
> >
> > When the consumer ACKs that the query results have been safely stored
> > or acted upon (I called this operation "bless" snapshot T+1) then and
> > only then can snapshot T be discarded.
> >
> > After snapshot T is discarded a new query will start snapshot T+2.
> > A changed files query result includes the id of the last blessed snapshot.
> >
> > I think this is more or less equivalent to the SRCU that you suggested,
> > but all the work is done in userspace at application level.
> >
> > If you see any problem with this scheme or don't understand it
> > please let me know and I will try to explain better.
> >
> 
> Hmm I guess "crash safety" is not well defined.
> You and I were talking about "system crash" and indeed, this was
> my only concern with kernel implementation of overlayfs watch.
> 
> But with userspace HSM service, how can we guarantee that
> modifications did not happen while the service is down?
> 
> I don't really have a good answer for this.

Very good point!
 
> Thinking out loud, we would somehow need to make the default
> permission deny for all modifications, maybe through some mount
> property (e.g. MOUNT_ATTR_PROT_READ), causing the pre-write
> hooks to default to EROFS if there is no "vfs filter" mount mark.
> 
> Then it will be possible to expose a "safe" mount to users, where
> modifications can never go unnoticed even when HSM service
> crashes.

Yeah, something like this. Although the bootstrap of this during mount may
be a bit challenging. But maybe not.

Also I'm thinking about other usecases - for HSM I agree we essentially
need to take the FS down if the userspace counterpart is not working. What
about other persistent change log usecases? Do we mandate that there is
only one "persistent change log" daemon in the system (or per filesystem?)
and that must be running or we take the filesystem down? And anybody who
wants reliable notifications needs to consume service of this daemon?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-21 16:40                                     ` Amir Goldstein
@ 2022-11-23 12:11                                       ` Jan Kara
  2022-11-23 13:30                                         ` Amir Goldstein
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2022-11-23 12:11 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner

On Mon 21-11-22 18:40:06, Amir Goldstein wrote:
> On Wed, Nov 16, 2022 at 6:24 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > > Can't we introduce some SRCU lock / unlock into
> > > > > file_start_write() / file_end_write() and then invoke synchronize_srcu()
> > > > > during checkpoint after removing ignore marks? It will be much cheaper as
> > > > > we don't have to flush all dirty data to disk and also writes can keep
> > > > > flowing while we wait for outstanding writes straddling checkpoint to
> > > > > complete. What do you think?
> > > >
> > > > Maybe, but this is not enough.
> > > > Note that my patches [1] are overlapping fsnotify_mark_srcu with
> > > > file_start_write(), so we would need to overlay fsnotify_mark_srcu
> > > > with this new "modify SRCU".
> > > >
> > > > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content
> > >
> > > Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather
> > > have the "modify SRCU" cover the whole region we need - which means
> > > including the generation of PRE_MODIFY event.
> > >
> >
> > Yeh, it would be great if we can pull this off.
> 
> OK. I decided to give this a shot, see:
> 
> https://github.com/amir73il/linux/commits/sb_write_barrier
> 
> It is just a sketch to show the idea, very lightly tested.
> What I did is, instead of converting all the sb_start,end_write()
> call sites, which would be a huge change, only callers of
> sb_start,end_write_srcu() participate in the "modify SRCU".
> 
> I then converted all the dir modify call sites and some other
> call sites to use helpers that take SRCU and call pre-modify hooks.
> 
> [...]

I've glanced over the changes and yes, that's what I was imagining :).

> > > > > The technical problem I see is how to deal with AIO / io_uring because
> > > > > SRCU needs to be released in the same context as it is acquired - that
> > > > > would need to be consulted with Paul McKenney if we can make it work. And
> > > > > another problem I see is that it might not be great to have this
> > > > > system-wide as e.g. on networking filesystems or pipes writes can block for
> > > > > really long.
> > > > >
> 
> I averted this problem for now - file data writes are not covered by
> s_write_srcu with my POC branch.

Since you've made the SRCU per sb there is no problem with writes blocking
too long on some filesystems. I've asked RCU guys about the problem with
SRCU being acquired / released from different contexts. Logically, it seems
it should be possible to make this work but maybe I miss some technical
detail.
 
> The rationale is that with file data write, HSM would anyway need to
> use fsfreeze() to get any guarantee, so maybe s_write_srcu is not really
> useful here??
> 
> It might be useful to use s_write_srcu to cover the pre-modify event
> up to after sb_start_write() in file write/aio/io_uring, but not byond it,
> so that sb_write_barrier()+fsfreeze() will provide full coverage for
> in-progress writes.
> 
> Please let me know if this plan sounds reasonable.

Let's see what RCU guys reply. I'd prefer to cover the whole write for
simplicity if it is reasonably possible.

> > > > > Final question is how to expose this to userspace because this
> > > > > functionality would seem useful outside of filesystem notification space so
> > > > > probably do not need to tie it to that.
> 
> In the current POC branch, nothing calls sb_write_barrier() yet.
> I was thinking of using it when flushing marks, maybe with the
> FAN_MARK_SYNC flag that I proposed.

Yes, that's what I'd imagine as well.

> For general purpose API, the semantics would need to be better
> defined, as with this "opt-in" implementation, only some of the
> modification operations are covered by the 'sb write barrier'.

Yes, for now we can keep things internal to fsnotify.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-23 10:49                                       ` Jan Kara
@ 2022-11-23 13:07                                         ` Amir Goldstein
  0 siblings, 0 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-11-23 13:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

> > Hmm I guess "crash safety" is not well defined.
> > You and I were talking about "system crash" and indeed, this was
> > my only concern with kernel implementation of overlayfs watch.
> >
> > But with userspace HSM service, how can we guarantee that
> > modifications did not happen while the service is down?
> >
> > I don't really have a good answer for this.
>
> Very good point!
>
> > Thinking out loud, we would somehow need to make the default
> > permission deny for all modifications, maybe through some mount
> > property (e.g. MOUNT_ATTR_PROT_READ), causing the pre-write
> > hooks to default to EROFS if there is no "vfs filter" mount mark.
> >
> > Then it will be possible to expose a "safe" mount to users, where
> > modifications can never go unnoticed even when HSM service
> > crashes.
>
> Yeah, something like this. Although the bootstrap of this during mount may
> be a bit challenging. But maybe not.
>

I don't think so.
As I wrote on several occasions, some of the current HSMs are implemented
as FUSE filesystems and require mount.

As I imagine an HSM system (and as our in-house system works)
there is a filesystem containing populated and unpopulated files that admin
can access without any filters and there is a mount that is exposed to users
where the filtering and on-demand populate happens.

I am less worried about bringup.
My HttpDirFS POC already does mount move of a marked mount on startup.
My concern was how to handle dying fanotify group safely.

> Also I'm thinking about other usecases - for HSM I agree we essentially
> need to take the FS down if the userspace counterpart is not working. What
> about other persistent change log usecases? Do we mandate that there is
> only one "persistent change log" daemon in the system (or per filesystem?)
> and that must be running or we take the filesystem down? And anybody who
> wants reliable notifications needs to consume service of this daemon?

Yes, I envision a single systemd-fsmonitor daemon (or instance per sb) that
can handle subscribing to changes on subtree and can deal with the permission
of dispatching events on subtrees.

To answer your question, I think the bare minimum that we need to provide
is a property of the mount (probably an event mask) that requires at least one
active fanotify vfs filter to allow certain permission events to go through.

I think it would make sense to allow a single FAN_CLASS_VFS_FILTER
group mark per sb and one per mount.

If use cases that require more vfs filters per sb/mount arise, we can
revisit that restriction later.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-23 12:11                                       ` Jan Kara
@ 2022-11-23 13:30                                         ` Amir Goldstein
  0 siblings, 0 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-11-23 13:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Wed, Nov 23, 2022 at 2:11 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 21-11-22 18:40:06, Amir Goldstein wrote:
> > On Wed, Nov 16, 2022 at 6:24 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > > > > Can't we introduce some SRCU lock / unlock into
> > > > > > file_start_write() / file_end_write() and then invoke synchronize_srcu()
> > > > > > during checkpoint after removing ignore marks? It will be much cheaper as
> > > > > > we don't have to flush all dirty data to disk and also writes can keep
> > > > > > flowing while we wait for outstanding writes straddling checkpoint to
> > > > > > complete. What do you think?
> > > > >
> > > > > Maybe, but this is not enough.
> > > > > Note that my patches [1] are overlapping fsnotify_mark_srcu with
> > > > > file_start_write(), so we would need to overlay fsnotify_mark_srcu
> > > > > with this new "modify SRCU".
> > > > >
> > > > > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content
> > > >
> > > > Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather
> > > > have the "modify SRCU" cover the whole region we need - which means
> > > > including the generation of PRE_MODIFY event.
> > > >
> > >
> > > Yeh, it would be great if we can pull this off.
> >
> > OK. I decided to give this a shot, see:
> >
> > https://github.com/amir73il/linux/commits/sb_write_barrier
> >
> > It is just a sketch to show the idea, very lightly tested.
> > What I did is, instead of converting all the sb_start,end_write()
> > call sites, which would be a huge change, only callers of
> > sb_start,end_write_srcu() participate in the "modify SRCU".
> >
> > I then converted all the dir modify call sites and some other
> > call sites to use helpers that take SRCU and call pre-modify hooks.
> >
> > [...]
>
> I've glanced over the changes and yes, that's what I was imagining :).

FYI, just pushed an update to that branch which includes *write_srcu
wrappers for all the call sites of security_file_permission(file, MAY_WRITE),
mostly file_start_write_area()/file_end_write_srcu().

For async write, write_srcu only covers the pre-modify event and the
io submittion.

>
> > > > > > The technical problem I see is how to deal with AIO / io_uring because
> > > > > > SRCU needs to be released in the same context as it is acquired - that
> > > > > > would need to be consulted with Paul McKenney if we can make it work. And
> > > > > > another problem I see is that it might not be great to have this
> > > > > > system-wide as e.g. on networking filesystems or pipes writes can block for
> > > > > > really long.
> > > > > >
> >
> > I averted this problem for now - file data writes are not covered by
> > s_write_srcu with my POC branch.
>
> Since you've made the SRCU per sb there is no problem with writes blocking
> too long on some filesystems. I've asked RCU guys about the problem with
> SRCU being acquired / released from different contexts. Logically, it seems
> it should be possible to make this work but maybe I miss some technical
> detail.
>
> > The rationale is that with file data write, HSM would anyway need to
> > use fsfreeze() to get any guarantee, so maybe s_write_srcu is not really
> > useful here??
> >

That was a wrong statement. For data writes HSM would need to do
syncfs() after sb_write_barrier(), not fsfreeze.
With the current POC branch, fsfreeze would only be needed to wait
for the async write completions.

> > It might be useful to use s_write_srcu to cover the pre-modify event
> > up to after sb_start_write() in file write/aio/io_uring, but not byond it,
> > so that sb_write_barrier()+fsfreeze() will provide full coverage for
> > in-progress writes.
> >
> > Please let me know if this plan sounds reasonable.
>
> Let's see what RCU guys reply. I'd prefer to cover the whole write for
> simplicity if it is reasonably possible.
>

OK.

Other backup plans:
- Set a flag in pre-modify events from async write so HSM
  knowns that event is not "atomic" with modification
- HSM may deny all events on this sort if it needs to record
  a change or it may mark the snapshot "requires freeze"
- Reliably deliver post-write events only for the async write
  completions (also on error), so HSM can track only those
  event pairs

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: thoughts about fanotify and HSM
  2022-11-23 10:10                                     ` Jan Kara
@ 2022-11-23 15:16                                       ` Amir Goldstein
  0 siblings, 0 replies; 43+ messages in thread
From: Amir Goldstein @ 2022-11-23 15:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner

On Wed, Nov 23, 2022 at 12:10 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 16-11-22 18:24:06, Amir Goldstein wrote:
> > > > Why then give up on the POST_WRITE events idea?
> > > > Don't you think it could work?
> > >
> > > So as we are discussing, the POST_WRITE event is not useful when we want to
> > > handle crash safety. And if we have some other mechanism (like SRCU) which
> > > is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> > > I'm not against POST_WRITE, I just don't see much value in it if we have
> > > another mechanism to deal with events straddling checkpoint.
> > >
> >
> > Not sure I follow.
> >
> > I think that crash safety can be achieved also with PRE/POST_WRITE:
> > - PRE_WRITE records an intent to write in persistent snapshot T
> >   and add to in-memory map of in-progress writes of period T
> > - When "checkpoint T" starts, new PRE_WRITES are recorded in both
> >   T and T+1 persistent snapshots, but event is added only to
> >   in-memory map of in-progress writes of period T+1
> > - "checkpoint T" ends when all in-progress writes of T are completed
>
> So maybe I miss something but suppose the situation I was mentioning few
> emails earlier:
>
> PRE_WRITE for F                 -> F recorded as modified in T
> modify F
> POST_WRITE for F
>
> PRE_WRITE for F                 -> ignored because F is already marked as
>                                    modified
>
>                                 -> checkpoint T requested, modified files
>                                    reported, process modified files
> modify F
> --------- crash
>
> Now unless filesystem freeze or SRCU is part of checkpoint, we will never
> notify about the last modification to F. So I don't see how PRE +
> POST_WRITE alone can achieve crash safety...
>
> And if we use filesystem freeze or SRCU as part of checkpoint, then
> processing of POST_WRITE events does not give us anything new. E.g.
> synchronize_srcu() during checkpoing before handing out list of modified
> files makes sure all modifications to files for which PRE_MODIFY events
> were generated (and thus are listed as modified in checkpoint T) are
> visible for userspace.
>
> So am I missing some case where POST_WRITE would be more useful than SRCU?
> Because at this point I'd rather implement SRCU than POST_WRITE.
>

I tend to agree. Even if POST_WRITE can be done,
SRCU will be far better.

> > The trick with alternating snapshots "handover" is this
> > (perhaps I never explained it and I need to elaborate on the wiki [1]):
> >
> > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query
> >
> > The changed files query results need to include recorded changes in both
> > "finalizing" snapshot T and the new snapshot T+1 that was started in
> > the beginning of the query.
> >
> > Snapshot T MUST NOT be discarded until checkpoint/handover
> > is complete AND the query results that contain changes recorded
> > in T and T+1 snapshots have been consumed.
> >
> > When the consumer ACKs that the query results have been safely stored
> > or acted upon (I called this operation "bless" snapshot T+1) then and
> > only then can snapshot T be discarded.
> >
> > After snapshot T is discarded a new query will start snapshot T+2.
> > A changed files query result includes the id of the last blessed snapshot.
> >
> > I think this is more or less equivalent to the SRCU that you suggested,
> > but all the work is done in userspace at application level.
> >
> > If you see any problem with this scheme or don't understand it
> > please let me know and I will try to explain better.
>
> So until now I was imagining that query results will be returned like a one
> big memcpy. I.e. one off event where the "persistent log daemon" hands over
> the whole contents of checkpoint T to the client. Whatever happens with the
> returned data is the bussiness of the client, whatever happens with the
> checkpoint T records in the daemon is the daemon's bussiness. The model you
> seem to speak about here is somewhat different - more like readdir() kind
> of approach where client asks for access to checkpoint T data, daemon
> provides the data record by record (probably serving the data from its
> files on disk), and when the client is done and "closes" checkpoint T,
> daemon's records about checkpoint T can be erased. Am I getting it right?
>

Yes, something like that.
The query result (which is actually a recursive readdir) could be huge.
So it cannot really be returned as a blob, it must be steamed to consumers.

> This however seems somewhat orthogonal to the SRCU idea. SRCU essentially
> serves the only purpose - make sure that modifications to all files for
> which we have received PRE_WRITE event are visible in respective files.
>

Absolutely right.
Sorry for the noise, but at least you've learned one more thing
about my persistent change snapshots architecture ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2022-11-23 15:16 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-11 18:12 thoughts about fanotify and HSM Amir Goldstein
2022-09-12 12:57 ` Jan Kara
2022-09-12 16:38   ` Amir Goldstein
     [not found]     ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
2022-09-13  2:41       ` Amir Goldstein
2022-09-14  7:27     ` Amir Goldstein
2022-09-14 10:30       ` Jan Kara
2022-09-14 11:52         ` Amir Goldstein
2022-09-20 18:19           ` Amir Goldstein
2022-09-22 10:48             ` Jan Kara
2022-09-22 13:03               ` Amir Goldstein
2022-09-26 15:27                 ` Jan Kara
2022-09-28 12:29                   ` Amir Goldstein
2022-09-29 10:01                     ` Jan Kara
2022-10-07 13:58                       ` Amir Goldstein
2022-10-12 15:44                         ` Jan Kara
2022-10-12 16:28                           ` Amir Goldstein
2022-10-13 12:16                             ` Amir Goldstein
2022-11-03 12:57                               ` Jan Kara
2022-11-03 13:38                                 ` Amir Goldstein
2022-10-28 12:50               ` Amir Goldstein
2022-11-03 16:30                 ` Jan Kara
2022-11-04  8:17                   ` Amir Goldstein
2022-11-07 11:10                     ` Jan Kara
2022-11-07 14:13                       ` Amir Goldstein
2022-11-14 19:17                         ` Jan Kara
2022-11-14 20:08                           ` Amir Goldstein
2022-11-15 10:16                             ` Jan Kara
2022-11-15 13:08                               ` Amir Goldstein
2022-11-16 10:56                                 ` Jan Kara
2022-11-16 16:24                                   ` Amir Goldstein
2022-11-17 12:38                                     ` Amir Goldstein
2022-11-23 10:49                                       ` Jan Kara
2022-11-23 13:07                                         ` Amir Goldstein
2022-11-21 16:40                                     ` Amir Goldstein
2022-11-23 12:11                                       ` Jan Kara
2022-11-23 13:30                                         ` Amir Goldstein
2022-11-23 10:10                                     ` Jan Kara
2022-11-23 15:16                                       ` Amir Goldstein
     [not found]     ` <BY5PR07MB6529795F49FB4E923AFCB062A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
2022-09-14  9:29       ` Jan Kara
2022-09-21 23:27 ` Dave Chinner
2022-09-22  4:35   ` Amir Goldstein
2022-09-23  7:57     ` Dave Chinner
2022-09-23 11:22       ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.