* thoughts about fanotify and HSM @ 2022-09-11 18:12 Amir Goldstein 2022-09-12 12:57 ` Jan Kara 2022-09-21 23:27 ` Dave Chinner 0 siblings, 2 replies; 43+ messages in thread From: Amir Goldstein @ 2022-09-11 18:12 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel Hi Jan, I wanted to consult with you about preliminary design thoughts for implementing a hierarchical storage manager (HSM) with fanotify. I have been in contact with some developers in the past who were interested in using fanotify to implement HSM (to replace old DMAPI implementation). Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM should be enough to implement a basic HSM, but it is not sufficient for implementing more advanced HSM features. Some of the HSM feature that I would like are: - blocking hook before access to file range and fill that range - blocking hook before lookup of child and optionally create child My thoughts on the UAPI were: - Allow new combination of FAN_CLASS_PRE_CONTENT and FAN_REPORT_FID/DFID_NAME - This combination does not allow any of the existing events in mask - It Allows only new events such as FAN_PRE_ACCESS FAN_PRE_MODIFY and FAN_PRE_LOOKUP - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have optional file range info - All the FAN_PRE_ events are called outside vfs locks and specifically before sb_writers lock as in my fsnotify_pre_modify [1] POC That last part is important because the HSM daemon will need to make modifications to the accessed file/directory before allowing the operation to proceed. Naturally that opens the possibility for new userspace deadlocks. Nothing that is not already possible with permission event, but maybe deadlocks that are more inviting to trip over. I am not sure if we need to do anything about this, but we could make it easier to ignore events from the HSM daemon itself if we want to, to make the userspace implementation easier. Another thing that might be good to do is provide an administrative interface to iterate and abort pending fanotify permission/pre-content events. You must have noticed the overlap between my old persistent change tracking journal and this design. The referenced branch is from that old POC. I do believe that the use cases somewhat overlap and that the same building blocks could be used to implement a persistent change journal in userspace as you suggested back then. Thoughts? Amir. [1] https://github.com/amir73il/linux/commits/fsnotify_pre_modify ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-11 18:12 thoughts about fanotify and HSM Amir Goldstein @ 2022-09-12 12:57 ` Jan Kara 2022-09-12 16:38 ` Amir Goldstein 2022-09-21 23:27 ` Dave Chinner 1 sibling, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-09-12 12:57 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel Hi Amir! On Sun 11-09-22 21:12:06, Amir Goldstein wrote: > I wanted to consult with you about preliminary design thoughts > for implementing a hierarchical storage manager (HSM) > with fanotify. > > I have been in contact with some developers in the past > who were interested in using fanotify to implement HSM > (to replace old DMAPI implementation). Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels ;) So how serious are these guys about HSM and investing into it? Because kernel is going to be only a small part of what's needed for it to be useful and we've dropped DMAPI from SUSE kernels because the code was painful to carry (and forwardport + it was not of great quality) and the demand for it was not really big... So I'd prefer to avoid the major API extension unless there are serious users out there - perhaps we will even need to develop the kernel API in cooperation with the userspace part to verify the result is actually usable and useful. But for now we can take it as an interesting mental excercise ;) > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > should be enough to implement a basic HSM, but it is not > sufficient for implementing more advanced HSM features. > > Some of the HSM feature that I would like are: > - blocking hook before access to file range and fill that range > - blocking hook before lookup of child and optionally create child > > My thoughts on the UAPI were: > - Allow new combination of FAN_CLASS_PRE_CONTENT > and FAN_REPORT_FID/DFID_NAME > - This combination does not allow any of the existing events > in mask > - It Allows only new events such as FAN_PRE_ACCESS > FAN_PRE_MODIFY and FAN_PRE_LOOKUP > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have > optional file range info > - All the FAN_PRE_ events are called outside vfs locks and > specifically before sb_writers lock as in my fsnotify_pre_modify [1] > POC > > That last part is important because the HSM daemon will > need to make modifications to the accessed file/directory > before allowing the operation to proceed. My main worry here would be that with FAN_FILESYSTEM marks, there will be far to many events (especially for the lookup & access cases) to reasonably process. And since the events will be blocking, the impact on performance will be large. I think that a reasonably efficient HSM will have to stay in the kernel (without generating work for userspace) for the "nothing to do" case. And only in case something needs to be migrated, event is generated and userspace gets involved. But it isn't obvious to me how to do this with fanotify (I could imagine it with say overlayfs which is kind of HSM solution already ;)). > Naturally that opens the possibility for new userspace > deadlocks. Nothing that is not already possible with permission > event, but maybe deadlocks that are more inviting to trip over. > > I am not sure if we need to do anything about this, but we > could make it easier to ignore events from the HSM daemon > itself if we want to, to make the userspace implementation easier. So if the events happen only in the "migration needed" case, I don't think deadlocks would be too problematic - it just requires a bit of care from userspace so that the event processing & migration processes do not access HSM managed stuff. > Another thing that might be good to do is provide an administrative > interface to iterate and abort pending fanotify permission/pre-content > events. You can always kill the listener. Or are you worried about cases where it sleeps in UN state? > You must have noticed the overlap between my old persistent > change tracking journal and this design. The referenced branch > is from that old POC. > > I do believe that the use cases somewhat overlap and that the > same building blocks could be used to implement a persistent > change journal in userspace as you suggested back then. > > Thoughts? Yes, there is some overlap. But OTOH HSM seems to require more detailed and generally more frequent events which seems like a challenge. > [1] https://github.com/amir73il/linux/commits/fsnotify_pre_modify Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-12 12:57 ` Jan Kara @ 2022-09-12 16:38 ` Amir Goldstein [not found] ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com> ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Amir Goldstein @ 2022-09-12 16:38 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Miklos Szeredi On Mon, Sep 12, 2022 at 3:57 PM Jan Kara <jack@suse.cz> wrote: > > Hi Amir! > > On Sun 11-09-22 21:12:06, Amir Goldstein wrote: > > I wanted to consult with you about preliminary design thoughts > > for implementing a hierarchical storage manager (HSM) > > with fanotify. > > > > I have been in contact with some developers in the past > > who were interested in using fanotify to implement HSM > > (to replace old DMAPI implementation). > > Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels > ;) > > So how serious are these guys about HSM and investing into it? Because Let's put it this way. They had to find a replacement for DMAPI so that they could stop carrying DMAPI patches, so pretty serious. They had to do it one way or the other. They approached me around the time that FAN_MARK_FILESYSTEM was merged, so I explained them how to implement HSM using FAN_MARK_FILESYSTEM+FAN_OPEN_PERM Whether they ended up using it or not - I don't know. But I do know for a fact that there are several companies out there implementing HSM to tier local storage to cloud and CTERA is one of those companies. We use FUSE to implement HSM and I have reason to believe that other companies do that as well. > kernel is going to be only a small part of what's needed for it to be > useful and we've dropped DMAPI from SUSE kernels because the code was > painful to carry (and forwardport + it was not of great quality) and the > demand for it was not really big... So I'd prefer to avoid the major API > extension unless there are serious users out there - perhaps we will even > need to develop the kernel API in cooperation with the userspace part to > verify the result is actually usable and useful. But for now we can take it > as an interesting mental excercise ;) > Certainly. Let's make this a "Call for Users". > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > > should be enough to implement a basic HSM, but it is not > > sufficient for implementing more advanced HSM features. > > > > Some of the HSM feature that I would like are: > > - blocking hook before access to file range and fill that range > > - blocking hook before lookup of child and optionally create child > > > > My thoughts on the UAPI were: > > - Allow new combination of FAN_CLASS_PRE_CONTENT > > and FAN_REPORT_FID/DFID_NAME > > - This combination does not allow any of the existing events > > in mask > > - It Allows only new events such as FAN_PRE_ACCESS > > FAN_PRE_MODIFY and FAN_PRE_LOOKUP > > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have > > optional file range info > > - All the FAN_PRE_ events are called outside vfs locks and > > specifically before sb_writers lock as in my fsnotify_pre_modify [1] > > POC > > > > That last part is important because the HSM daemon will > > need to make modifications to the accessed file/directory > > before allowing the operation to proceed. > > My main worry here would be that with FAN_FILESYSTEM marks, there will be > far to many events (especially for the lookup & access cases) to reasonably > process. And since the events will be blocking, the impact on performance > will be large. > Right. That problem needs to be addressed. > I think that a reasonably efficient HSM will have to stay in the kernel > (without generating work for userspace) for the "nothing to do" case. And > only in case something needs to be migrated, event is generated and > userspace gets involved. But it isn't obvious to me how to do this with > fanotify (I could imagine it with say overlayfs which is kind of HSM > solution already ;)). > Yeh, overlayfs was on my radar for this as well ;) as well as eBPF-FUSE: https://lore.kernel.org/linux-fsdevel/CAOQ4uxh2OZ_AMp6XRcMy0ZtjkQnBfBZFhH0t-+Pd298uPuSEVw@mail.gmail.com/ There is an existing HSM solution out there that this proposal is modeled after: https://docs.microsoft.com/en-us/windows/win32/projfs/projected-file-system The Windows ProjFS hooks into userspace for specific files that were marked some way. There was an attempt to implement this interface for Linux but this project is discontinued: https://github.com/github/libprojfs/ To do the same thing with fanotify we would either need to define a new inode flag (i.e. chattr) for this purpose or (much better) implement an eBPF filter for FAN_MARK_FILESYSTEM, so that every HSM can choose whichever inode flag it wants. > > Naturally that opens the possibility for new userspace > > deadlocks. Nothing that is not already possible with permission > > event, but maybe deadlocks that are more inviting to trip over. > > > > I am not sure if we need to do anything about this, but we > > could make it easier to ignore events from the HSM daemon > > itself if we want to, to make the userspace implementation easier. > > So if the events happen only in the "migration needed" case, I don't think > deadlocks would be too problematic - it just requires a bit of care from > userspace so that the event processing & migration processes do not access > HSM managed stuff. > Right. > > Another thing that might be good to do is provide an administrative > > interface to iterate and abort pending fanotify permission/pre-content > > events. > > You can always kill the listener. Or are you worried about cases where it > sleeps in UN state? > FUSE, which is technically also a userspace hook for filesystem operations can end up in a state where the daemon cannot be killed to release a deadlock which is why the administrative abort interface is needed: https://www.kernel.org/doc/html/latest/filesystems/fuse.html#aborting-a-filesystem-connection I am not sure if we need to worry about this if the fanotify hooks are called outside of vfs locks (?). > > You must have noticed the overlap between my old persistent > > change tracking journal and this design. The referenced branch > > is from that old POC. > > > > I do believe that the use cases somewhat overlap and that the > > same building blocks could be used to implement a persistent > > change journal in userspace as you suggested back then. > > > > Thoughts? > > Yes, there is some overlap. But OTOH HSM seems to require more detailed and > generally more frequent events which seems like a challenge. > I should have mentioned this in my proposal, but you are right - kernel filtering is an essential part of implementing HSM using fanotify. The "swap in" should actually be an uncommon case and after it happens on a certain file that file becomes "warm" and then "swap out" of that file would be discouraged. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com>]
* Re: thoughts about fanotify and HSM [not found] ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com> @ 2022-09-13 2:41 ` Amir Goldstein 0 siblings, 0 replies; 43+ messages in thread From: Amir Goldstein @ 2022-09-13 2:41 UTC (permalink / raw) To: Plaster, Robert; +Cc: Jan Kara, linux-fsdevel, Miklos Szeredi > From: Amir Goldstein <amir73il@gmail.com> > Date: Monday, September 12, 2022 at 9:38 AM > To: Jan Kara <jack@suse.cz> > Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>, Miklos Szeredi <miklos@szeredi.hu> > Subject: Re: thoughts about fanotify and HSM > > On Mon, Sep 12, 2022 at 3:57 PM Jan Kara <jack@suse.cz> wrote: > > > > Hi Amir! > > > > On Sun 11-09-22 21:12:06, Amir Goldstein wrote: > > > I wanted to consult with you about preliminary design thoughts > > > for implementing a hierarchical storage manager (HSM) > > > with fanotify. > > > > > > I have been in contact with some developers in the past > > > who were interested in using fanotify to implement HSM > > > (to replace old DMAPI implementation). > > > > Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels > > ;) > > > > So how serious are these guys about HSM and investing into it? Because > > Let's put it this way. > They had to find a replacement for DMAPI so that they could stop > carrying DMAPI patches, so pretty serious. > They had to do it one way or the other. > > They approached me around the time that FAN_MARK_FILESYSTEM > was merged, so I explained them how to implement HSM using > FAN_MARK_FILESYSTEM+FAN_OPEN_PERM > Whether they ended up using it or not - I don't know. > On Tue, Sep 13, 2022 at 2:01 AM Plaster, Robert <rplaster@deepspacestorage.com> wrote: > > Hi Amir – Dan got back to me. He said (fyi - SMS referenced below is our HSM app): > Hi Rob, I will add that from what I read on your website [1], your entire product is open source, code is available on your web servers as rpms and will be put up on GitHub soon. That's very good news, because it means I will be able to demo my proposed fanotify interface improvements on your code base :) [1] https://deepspacestorage.com/resources/#downloads > > > “Amir talks about specific fanotify events used for an HSM. He says FAN_MARK_FILESYSTEM+FAN_OPEN_PERM should be enough for a basic HSM. As it is currently implemented, SMS uses FAN_MARK_ADD+FAN_OPEN_PERM to detect purged files and FAN_MARK_ADD+FAN_CLOSE_WRITE events to determine when a file has been potentially modified. There are other FANOTIFY events that would be useful, but we're currently limited by the older Linux kernels in the RHEL releases we're supporting. > > > > If I understand what Amil is proposing, it appears to be some new FANOTIFY FAN_PRE_* events. Some of it looks like something we would be interested in but as long as we continue to support older RHEL kernels, we're very limited to what we have to work with.” > My goal is to design the interfaces for the use of future more advanced HSM clients. The experience that you can bring to the table from customers using your current HSM client is very important for making design choices for future HSM clients - i.e. understand and address the pain points with current fanotify interface. So as long as this is "something that you would be interested in" I know I am in the right direction ;-) Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-12 16:38 ` Amir Goldstein [not found] ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com> @ 2022-09-14 7:27 ` Amir Goldstein 2022-09-14 10:30 ` Jan Kara [not found] ` <BY5PR07MB6529795F49FB4E923AFCB062A3449@BY5PR07MB6529.namprd07.prod.outlook.com> 2 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-09-14 7:27 UTC (permalink / raw) To: Jan Kara, Miklos Szeredi; +Cc: Plaster, Robert, David Howells, linux-fsdevel On Mon, Sep 12, 2022 at 7:38 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Sep 12, 2022 at 3:57 PM Jan Kara <jack@suse.cz> wrote: > > > > Hi Amir! > > > > On Sun 11-09-22 21:12:06, Amir Goldstein wrote: > > > I wanted to consult with you about preliminary design thoughts > > > for implementing a hierarchical storage manager (HSM) > > > with fanotify. > > > I feel that the discussion is losing focus, so let me try to refocus and list pros and cons for different options for HSM API... > > > I have been in contact with some developers in the past > > > who were interested in using fanotify to implement HSM > > > (to replace old DMAPI implementation). > > > > Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels > > ;) For the record, DMAPI is still partly supported on some proprietary filesystems, but even if a full implementation existed, this old API which was used for tape devices mostly is not a good fit for modern day cloud storage use cases. > > > > So how serious are these guys about HSM and investing into it? > > Let's put it this way. > They had to find a replacement for DMAPI so that they could stop > carrying DMAPI patches, so pretty serious. > They had to do it one way or the other. > As mentioned earlier, this is an open source HSM project [1] with a release coming soon that is using FAN_OPEN_PERM to migrate data from the slower tier. As you can imagine, FAN_OPEN_PERM can only get you as far as DMAPI but not beyond and it leaves the problem of setting the marks on all punched files on bringup. > > But I do know for a fact that there are several companies out there > implementing HSM to tier local storage to cloud and CTERA is one of > those companies. > > We use FUSE to implement HSM and I have reason to believe that > other companies do that as well. > FUSE is the most flexible API to implement HSM, but it suffers from performance overhead in the "fast" path due to context switches and cache line bounces. FUSE_PASSTHROUGH patches [2] address this overhead for large files IO. I plan to upstream those patches. FUSE-BPF [3] and former extFUSE [4] projects aim to address this overhead for readdir and other operations. This is an alluring option for companies that already use FUSE for HSM, because they will not need to change their implementation much, but my gut feeling is that there are interesting corner cases lurking... > > kernel is going to be only a small part of what's needed for it to be > > useful and we've dropped DMAPI from SUSE kernels because the code was > > painful to carry (and forwardport + it was not of great quality) and the > > demand for it was not really big... Note that the demand was not big for the crappy DMAPI ;) it does not say anything about the demand for HSM solutions, which exists and is growing IMO. > > So I'd prefer to avoid the major API > > extension unless there are serious users out there - perhaps we will even > > need to develop the kernel API in cooperation with the userspace part to > > verify the result is actually usable and useful. Yap. It should be trivial to implement a "mirror" HSM backend. For example, the libprojfs [5] projects implements a MirrorProvider backend for the Microsoft ProjFS [6] HSM API. > > > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > > > should be enough to implement a basic HSM, but it is not > > > sufficient for implementing more advanced HSM features. > > > [...] > > My main worry here would be that with FAN_FILESYSTEM marks, there will be > > far to many events (especially for the lookup & access cases) to reasonably > > process. And since the events will be blocking, the impact on performance > > will be large. > > > > Right. That problem needs to be addressed. > > > I think that a reasonably efficient HSM will have to stay in the kernel > > (without generating work for userspace) for the "nothing to do" case. And > > only in case something needs to be migrated, event is generated and > > userspace gets involved. But it isn't obvious to me how to do this with > > fanotify (I could imagine it with say overlayfs which is kind of HSM > > solution already ;)). > > It's true, overlayfs is kind of HSM, but: - Without swap out to slower tier - Without user control over method of swap in from slower tier On another thread regarding FUSE-BPF, Miklos also mentioned the option to add those features to overlayfs [7] to make it useful as an HSM kernel driver. So we have at least three options for an HSM kernel driver (FUSE, fanotify, overlayfs), but none of them is still fully equipped to drive a modern HSM implementation. What is clear is that: 1. The fast path must not context switch to userspace 2. The slow path needs an API for calling into user to migrate files/dirs What is not clear is: 1. The method to persistently mark files/dirs for fast/slow path 2. The API to call into userspace Overlayfs provides a method to mark files for slow path ('trusted.overlay.metacopy' xattr), meaning file that has metadata but not the data, but overlayfs does not provide the API to perform "user controlled migration" of the data. Instead of inventing a new API for that, I'd rather extend the known fanotify protocol and allow the new FAN_XXX_PRE events only on filesystems that have the concept of a file without its content (a.k.a. metacopy). We could say that filesystems that support fscache can also support FAN_XXX_PRE events, and perhaps cachefilesd could make use of hooks to implement user modules that populate the fscache objects out of band. There is the naive approach to interpret a "punched hole" in a file as "no content" as DMAPI did, to support FAN_XXX_PRE events on standard local filesystem (fscache does that internally). That would be an opt-in via fanotify_init() flag and could be useful for old DMAPI HSM implementations that are converted to use the new API. Practically, the filesystems that allow FAN_XXX_PRE events on punched files would need to advertise this support and maintain an inode flag (i.e. I_NODATA) to avoid a performance penalty on every file access. If we take that route, though, it might be better off to let the HSM daemon set this flag explicitly (e.g. chattr +X) when punching holes in files and removing the flag explicitly when filling the holes. And there is the most flexible option of attaching a BFP filter to a filesystem mark, but I am afraid that this program will be limited to using information already in the path/dentry/inode struct. At least HSM could use an existing arbitrary inode flag (e.g. chattr+i) as "persistent marks". So many options! I don't know which to choose :) If this plan sounds reasonable, I can start with a POC of "user controlled copy up/down" for overlayfs, using fanotify as the user notification protocol and see where it goes from there. Thanks for reading my brain dump ;) Amir. [1] https://deepspacestorage.com/ [2] https://lore.kernel.org/linux-fsdevel/20210125153057.3623715-1-balsini@android.com/ [3] https://lpc.events/event/16/contributions/1339/attachments/945/1861/LPC2022%20Fuse-bpf.pdf [4] https://github.com/extfuse/extfuse [5] https://github.com/github/libprojfs [6] https://docs.microsoft.com/en-us/windows/win32/api/_projfs/ [7] https://lore.kernel.org/linux-fsdevel/CAJfpegt4N2nmCQGmLSBB--NzuSSsO6Z0sue27biQd4aiSwvNFw@mail.gmail.com/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-14 7:27 ` Amir Goldstein @ 2022-09-14 10:30 ` Jan Kara 2022-09-14 11:52 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-09-14 10:30 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel On Wed 14-09-22 10:27:48, Amir Goldstein wrote: > On Mon, Sep 12, 2022 at 7:38 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > I have been in contact with some developers in the past > > > > who were interested in using fanotify to implement HSM > > > > (to replace old DMAPI implementation). > > > > > > Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels > > > ;) > > For the record, DMAPI is still partly supported on some proprietary > filesystems, but even if a full implementation existed, this old API > which was used for tape devices mostly is not a good fit for modern > day cloud storage use cases. Interesting, I didn't know DMAPI still lives :) > > > So how serious are these guys about HSM and investing into it? > > > > Let's put it this way. > > They had to find a replacement for DMAPI so that they could stop > > carrying DMAPI patches, so pretty serious. > > They had to do it one way or the other. > > > > As mentioned earlier, this is an open source HSM project [1] > with a release coming soon that is using FAN_OPEN_PERM > to migrate data from the slower tier. > > As you can imagine, FAN_OPEN_PERM can only get you as far > as DMAPI but not beyond and it leaves the problem of setting the > marks on all punched files on bringup. Sure I can see that FAN_OPEN_PERM works for basic usage but it certainly leaves room for improvement :) <snip nice summary of FUSE options> > > > So I'd prefer to avoid the major API > > > extension unless there are serious users out there - perhaps we will even > > > need to develop the kernel API in cooperation with the userspace part to > > > verify the result is actually usable and useful. > > Yap. It should be trivial to implement a "mirror" HSM backend. > For example, the libprojfs [5] projects implements a MirrorProvider > backend for the Microsoft ProjFS [6] HSM API. Well, validating that things work using some simple backend is one thing but we are probably also interested in whether the result is practical to use - i.e., whether the performance meets the needs, whether the API is not cumbersome for what HSM solutions need to do, whether the more advanced features like range-support are useful the way they are implemented etc. We can verify some of these things with simple mirror HSM backend but I'm afraid some of the problems may become apparent only once someone actually uses the result in practice and for that we need a userspace counterpart that does actually something useful so that people have motivation to use it :). > > > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > > > > should be enough to implement a basic HSM, but it is not > > > > sufficient for implementing more advanced HSM features. > > > > > [...] > > > My main worry here would be that with FAN_FILESYSTEM marks, there will be > > > far to many events (especially for the lookup & access cases) to reasonably > > > process. And since the events will be blocking, the impact on performance > > > will be large. > > > > > > > Right. That problem needs to be addressed. > > > > > I think that a reasonably efficient HSM will have to stay in the kernel > > > (without generating work for userspace) for the "nothing to do" case. And > > > only in case something needs to be migrated, event is generated and > > > userspace gets involved. But it isn't obvious to me how to do this with > > > fanotify (I could imagine it with say overlayfs which is kind of HSM > > > solution already ;)). > > > > > It's true, overlayfs is kind of HSM, but: > - Without swap out to slower tier > - Without user control over method of swap in from slower tier > > On another thread regarding FUSE-BPF, Miklos also mentioned > the option to add those features to overlayfs [7] to make it useful > as an HSM kernel driver. > > So we have at least three options for an HSM kernel driver (FUSE, > fanotify, overlayfs), but none of them is still fully equipped to drive > a modern HSM implementation. > > What is clear is that: > 1. The fast path must not context switch to userspace > 2. The slow path needs an API for calling into user to migrate files/dirs > > What is not clear is: > 1. The method to persistently mark files/dirs for fast/slow path > 2. The API to call into userspace Agreed. > Overlayfs provides a method to mark files for slow path > ('trusted.overlay.metacopy' xattr), meaning file that has metadata > but not the data, but overlayfs does not provide the API to perform > "user controlled migration" of the data. > > Instead of inventing a new API for that, I'd rather extend the > known fanotify protocol and allow the new FAN_XXX_PRE events > only on filesystems that have the concept of a file without its content > (a.k.a. metacopy). > > We could say that filesystems that support fscache can also support > FAN_XXX_PRE events, and perhaps cachefilesd could make use of > hooks to implement user modules that populate the fscache objects > out of band. One possible approach is that we would make these events explicitely targetted to HSM and generated directly by the filesystem which wants to support HSM. So basically when the filesystem finds out it needs the data filled in, it will call something like: fsnotify(inode, FAN_PRE_GIVE_ME_DATA, perhaps_some_details_here) Something like what we currently do for filesystem error events but in this case the event will work like a permission event. Userspace can be watching the filesystem with superblock mark to receive these events. The persistent marking of files is completely left upto the filesystem in this case - it has to decide when the FAN_PRE_GIVE_ME_DATA event needs to be generated for an inode. > There is the naive approach to interpret a "punched hole" in a file as > "no content" as DMAPI did, to support FAN_XXX_PRE events on > standard local filesystem (fscache does that internally). > That would be an opt-in via fanotify_init() flag and could be useful for > old DMAPI HSM implementations that are converted to use the new API. I'd prefer to leave these details upto the filesystem wanting to support HSM and not clutter fanotify API with details about file layout. > Practically, the filesystems that allow FAN_XXX_PRE events > on punched files would need to advertise this support and maintain > an inode flag (i.e. I_NODATA) to avoid a performance penalty > on every file access. If we take that route, though, it might be better > off to let the HSM daemon set this flag explicitly (e.g. chattr +X) > when punching holes in files and removing the flag explicitly > when filling the holes. Again, in what I propose this would be left upto the filesystem - e.g. it can have inode flag or xattr or something else to carry the information that this file is under HSM and call fsnotify() when the file is accessed. It might be challenging to fulfill your desire to generate the event outside of any filesystem locks with this design though. > And there is the most flexible option of attaching a BFP filter to > a filesystem mark, but I am afraid that this program will be limited > to using information already in the path/dentry/inode struct. > At least HSM could use an existing arbitrary inode flag > (e.g. chattr+i) as "persistent marks". > > So many options! I don't know which to choose :) > > If this plan sounds reasonable, I can start with a POC of > "user controlled copy up/down" for overlayfs, using fanotify > as the user notification protocol and see where it goes from there. Yeah, that might be interesting to see as an example. Another example of "kind-of-hsm" that we already have in the kernel is autofs. So we can think whether that could be implemented using the scheme we design as an excercise. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-14 10:30 ` Jan Kara @ 2022-09-14 11:52 ` Amir Goldstein 2022-09-20 18:19 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-09-14 11:52 UTC (permalink / raw) To: Jan Kara; +Cc: Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel > > > > So I'd prefer to avoid the major API > > > > extension unless there are serious users out there - perhaps we will even > > > > need to develop the kernel API in cooperation with the userspace part to > > > > verify the result is actually usable and useful. > > > > Yap. It should be trivial to implement a "mirror" HSM backend. > > For example, the libprojfs [5] projects implements a MirrorProvider > > backend for the Microsoft ProjFS [6] HSM API. > > Well, validating that things work using some simple backend is one thing > but we are probably also interested in whether the result is practical to > use - i.e., whether the performance meets the needs, whether the API is not > cumbersome for what HSM solutions need to do, whether the more advanced > features like range-support are useful the way they are implemented etc. > We can verify some of these things with simple mirror HSM backend but I'm > afraid some of the problems may become apparent only once someone actually > uses the result in practice and for that we need a userspace counterpart > that does actually something useful so that people have motivation to use > it :). Makes sense. > > Overlayfs provides a method to mark files for slow path > > ('trusted.overlay.metacopy' xattr), meaning file that has metadata > > but not the data, but overlayfs does not provide the API to perform > > "user controlled migration" of the data. > > > > Instead of inventing a new API for that, I'd rather extend the > > known fanotify protocol and allow the new FAN_XXX_PRE events > > only on filesystems that have the concept of a file without its content > > (a.k.a. metacopy). > > > > We could say that filesystems that support fscache can also support > > FAN_XXX_PRE events, and perhaps cachefilesd could make use of > > hooks to implement user modules that populate the fscache objects > > out of band. > > One possible approach is that we would make these events explicitely > targetted to HSM and generated directly by the filesystem which wants to > support HSM. So basically when the filesystem finds out it needs the data > filled in, it will call something like: > > fsnotify(inode, FAN_PRE_GIVE_ME_DATA, perhaps_some_details_here) > > Something like what we currently do for filesystem error events but in this > case the event will work like a permission event. Userspace can be watching > the filesystem with superblock mark to receive these events. The persistent > marking of files is completely left upto the filesystem in this case - it > has to decide when the FAN_PRE_GIVE_ME_DATA event needs to be generated for > an inode. > Woh! that's DMAPI pain all over again. I do not want to go there. I have no capacity nor desire to drive this sort of changes through every single filesystem and convince the maintainers to support those hooks forever. (seems that you have experienced this pain yourself). I would like to be able to have a generic vfs implementation. I do not want to rely on the filesystem at all to decide when local data is valid or not for the HSM. That should be completely up to the HSM implementation to decide. My example with punched file was just to demonstrate how an implementation that is DMAPI drop-in replacement would look like. It might be ok for specific filesystems, e.g. overlayfs, to call fsnotify(inode, FAN_PRE_XXX,...) e.g. before copy up. That would be done as the remote notifications we discussed for fuse/cifs. Still need to think about how remote and local notifications are distinguished, but at least a bit in the event mask for "remote" should be enough for the eBPF filter to make the decision. > > There is the naive approach to interpret a "punched hole" in a file as > > "no content" as DMAPI did, to support FAN_XXX_PRE events on > > standard local filesystem (fscache does that internally). > > That would be an opt-in via fanotify_init() flag and could be useful for > > old DMAPI HSM implementations that are converted to use the new API. > > I'd prefer to leave these details upto the filesystem wanting to support > HSM and not clutter fanotify API with details about file layout. > Not clutter fanotify with details about filesystem - agreed. Clutter filesystem with details about HSM - disagree. A BPF filter on fanotify mark with the details about specific HSM should resolve both our objections. If one wants to write an HSM that works with "chattr +X" filter then this HSM will require filesystem support, but there are ways to implement an HSM that would be mostly filesystem agnostic. > > Practically, the filesystems that allow FAN_XXX_PRE events > > on punched files would need to advertise this support and maintain > > an inode flag (i.e. I_NODATA) to avoid a performance penalty > > on every file access. If we take that route, though, it might be better > > off to let the HSM daemon set this flag explicitly (e.g. chattr +X) > > when punching holes in files and removing the flag explicitly > > when filling the holes. > > Again, in what I propose this would be left upto the filesystem - e.g. it > can have inode flag or xattr or something else to carry the information > that this file is under HSM and call fsnotify() when the file is accessed. > It might be challenging to fulfill your desire to generate the event > outside of any filesystem locks with this design though. > Right. Another reason to dislike fs internal hooks. I think my eBPF suggestion does not suffer from this problem. > > And there is the most flexible option of attaching a BFP filter to > > a filesystem mark, but I am afraid that this program will be limited > > to using information already in the path/dentry/inode struct. > > At least HSM could use an existing arbitrary inode flag > > (e.g. chattr+i) as "persistent marks". > > > > So many options! I don't know which to choose :) > > > > If this plan sounds reasonable, I can start with a POC of > > "user controlled copy up/down" for overlayfs, using fanotify > > as the user notification protocol and see where it goes from there. > I am not sure anymore if overlayfs is a good place to start this POC. I think that an easier POC will be to use an example "demo filter" in C to check some inode flag, until the eBPF filter is implemented and write a demo backend to demonstrate the usefulness of the API extensions. > Yeah, that might be interesting to see as an example. Another example of > "kind-of-hsm" that we already have in the kernel is autofs. So we can think > whether that could be implemented using the scheme we design as an > excercise. > An interesting exercise, but so far, fsnotify was not dealing with namespace changes at all, it was completely restricted to notifying about filesystems changes, so I don't think now would be a good time to change that paradigm. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-14 11:52 ` Amir Goldstein @ 2022-09-20 18:19 ` Amir Goldstein 2022-09-22 10:48 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-09-20 18:19 UTC (permalink / raw) To: Jan Kara Cc: Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel, Fufu Fang On Wed, Sep 14, 2022 at 2:52 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > So I'd prefer to avoid the major API > > > > > extension unless there are serious users out there - perhaps we will even > > > > > need to develop the kernel API in cooperation with the userspace part to > > > > > verify the result is actually usable and useful. > > > > > > Yap. It should be trivial to implement a "mirror" HSM backend. > > > For example, the libprojfs [5] projects implements a MirrorProvider > > > backend for the Microsoft ProjFS [6] HSM API. > > > > Well, validating that things work using some simple backend is one thing > > but we are probably also interested in whether the result is practical to > > use - i.e., whether the performance meets the needs, whether the API is not > > cumbersome for what HSM solutions need to do, whether the more advanced > > features like range-support are useful the way they are implemented etc. > > We can verify some of these things with simple mirror HSM backend but I'm > > afraid some of the problems may become apparent only once someone actually > > uses the result in practice and for that we need a userspace counterpart > > that does actually something useful so that people have motivation to use > > it :). > Hi Jan, I wanted to give an update on the POC that I am working on. I decided to find a FUSE HSM and show how it may be converted to use fanotify HSM hooks. HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local cache from a remote http on first access to every directory and file range. Normally, it would be run like this: ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/ Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements passthrough calls to the local cache dir if cache is already populated. After my conversion patches [1], this download-only HSM can be run like this without mounting FUSE: sudo ./httpdirfs --fanotify --cache-location /vdf/cache https://cdn.kernel.org/pub/ - [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content Browsing the cache directory at /vdf/cache, lazyly populates the local cache using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content using FAN_ACCESS_PERM read hooks. Up to this point, the implementation did not require any kernel changes. However, this type of command does not populate the path components, because lookup does not generate FAN_ACCESS_PERM event: stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM event [2] and used it to lazyly populate directories in the path ancestry. For now, I stuck with the XXX_PERM convention and did not require FAN_CLASS_PRE_CONTENT, although we probably should. [2] https://github.com/amir73il/linux/commits/fanotify_pre_content Streaming read of large files works as well, but only for sequential read patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events do not (yet) carry range info, so my naive implementation downloads one extra data chunk on each FAN_ACCESS_PERM until the cache file is full. This makes it possible to run commands like: tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz | less without having to wait for the entire 400MB file to download before seeing the first page. This streaming feature is extremely important for modern HSMs that are often used to archive large media files in the cloud. For the next steps of POC, I could do: - Report FAN_ACCESS_PERM range info to implement random read patterns (e.g. unzip -l) - Introduce FAN_MODIFY_PERM, so file content could be downloaded before modifying a read-write HSM cache - Demo conversion of a read-write FUSE HSM implementation (e.g. https://github.com/volga629/davfs2) - Demo HSM with filesystem mark [*] and a hardcoded test filter [*] Note that unlike the case with recursive inotify, this POC HSM implementation is not racy, because of the lookup permission events. A filesystem mark is still needed to avoid pinning all the unpopulated cache tree leaf entries to inode cache, so that this HSM could work on a very large scale tree, the same as my original use case for implementing filesystem mark. If what you are looking for is an explanation why fanotify HSM would be better than a FUSE HSM implementation then there are several reasons. Performance is at the top of the list. There is this famous USENIX paper [3] about FUSE passthrough performance. It is a bit outdated, but many parts are still relevant - you can ask the Android developers why they decided to work on FUSE-BFP... [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf For me, performance is one of the main concerns, but not the only one, so I am not entirely convinced that a full FUSE-BFP implementation would solve all my problems. When scaling to many millions of passthrough inodes, resource usage start becoming a limitation of a FUSE passthrough implementation and memory reclaim of native fs works a lot better than memory reclaim over FUSE over another native fs. When the workload works on the native filesystem, it is also possible to use native fs features (e.g. XFS ioctls). Questions: - What do you think about the direction this POC has taken so far? - Is there anything specific that you would like to see in the POC to be convinced that this API will be useful? Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-20 18:19 ` Amir Goldstein @ 2022-09-22 10:48 ` Jan Kara 2022-09-22 13:03 ` Amir Goldstein 2022-10-28 12:50 ` Amir Goldstein 0 siblings, 2 replies; 43+ messages in thread From: Jan Kara @ 2022-09-22 10:48 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel, Fufu Fang On Tue 20-09-22 21:19:25, Amir Goldstein wrote: > On Wed, Sep 14, 2022 at 2:52 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > So I'd prefer to avoid the major API > > > > > > extension unless there are serious users out there - perhaps we will even > > > > > > need to develop the kernel API in cooperation with the userspace part to > > > > > > verify the result is actually usable and useful. > > > > > > > > Yap. It should be trivial to implement a "mirror" HSM backend. > > > > For example, the libprojfs [5] projects implements a MirrorProvider > > > > backend for the Microsoft ProjFS [6] HSM API. > > > > > > Well, validating that things work using some simple backend is one thing > > > but we are probably also interested in whether the result is practical to > > > use - i.e., whether the performance meets the needs, whether the API is not > > > cumbersome for what HSM solutions need to do, whether the more advanced > > > features like range-support are useful the way they are implemented etc. > > > We can verify some of these things with simple mirror HSM backend but I'm > > > afraid some of the problems may become apparent only once someone actually > > > uses the result in practice and for that we need a userspace counterpart > > > that does actually something useful so that people have motivation to use > > > it :). > > > > Hi Jan, > > I wanted to give an update on the POC that I am working on. > I decided to find a FUSE HSM and show how it may be converted > to use fanotify HSM hooks. > > HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local > cache from a remote http on first access to every directory and file range. > > Normally, it would be run like this: > ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/ > > Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements > passthrough calls to the local cache dir if cache is already populated. > > After my conversion patches [1], this download-only HSM can be run like > this without mounting FUSE: > > sudo ./httpdirfs --fanotify --cache-location /vdf/cache > https://cdn.kernel.org/pub/ - > > [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content > > Browsing the cache directory at /vdf/cache, lazyly populates the local cache > using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content > using FAN_ACCESS_PERM read hooks. > > Up to this point, the implementation did not require any kernel changes. > However, this type of command does not populate the path components, > because lookup does not generate FAN_ACCESS_PERM event: > > stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz > > To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM > event [2] and used it to lazyly populate directories in the path ancestry. > For now, I stuck with the XXX_PERM convention and did not require > FAN_CLASS_PRE_CONTENT, although we probably should. > > [2] https://github.com/amir73il/linux/commits/fanotify_pre_content > > Streaming read of large files works as well, but only for sequential read > patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events > do not (yet) carry range info, so my naive implementation downloads > one extra data chunk on each FAN_ACCESS_PERM until the cache file is full. > > This makes it possible to run commands like: > > tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz > | less > > without having to wait for the entire 400MB file to download before > seeing the first page. > > This streaming feature is extremely important for modern HSMs > that are often used to archive large media files in the cloud. Thanks for update Amir! I've glanced through the series and so far it looks pretty simple and I'd have only some style / readability nits (but let's resolve those once we have something more complete). When thinking about HSM (and while following your discussion with Dave) I wondered about one thing: When the notifications happen before we take locks, then we are in principle prone to time-to-check-time-to-use races, aren't we? How are these resolved? For example something like: We have file with size 16k. Reader: Writer read 8k at offset 12k -> notification sent - HSM makes sure 12-16k is here and 16-20k is beyond eof so nothing to do expand file to 20k - now the file contents must not get moved out until the reader is done in order not to break it > For the next steps of POC, I could do: > - Report FAN_ACCESS_PERM range info to implement random read > patterns (e.g. unzip -l) > - Introduce FAN_MODIFY_PERM, so file content could be downloaded > before modifying a read-write HSM cache > - Demo conversion of a read-write FUSE HSM implementation > (e.g. https://github.com/volga629/davfs2) > - Demo HSM with filesystem mark [*] and a hardcoded test filter > > [*] Note that unlike the case with recursive inotify, this POC HSM > implementation is not racy, because of the lookup permission events. > A filesystem mark is still needed to avoid pinning all the unpopulated > cache tree leaf entries to inode cache, so that this HSM could work on > a very large scale tree, the same as my original use case for implementing > filesystem mark. Sounds good! Just with your concern about pinning - can't you use evictable marks added on lookup for files / dirs you want to track? Maybe it isn't great design for other reasons but it would save you some event filtering... > If what you are looking for is an explanation why fanotify HSM would be better > than a FUSE HSM implementation then there are several reasons. > Performance is at the top of the list. There is this famous USENIX paper [3] > about FUSE passthrough performance. > It is a bit outdated, but many parts are still relevant - you can ask > the Android > developers why they decided to work on FUSE-BFP... > > [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf > > For me, performance is one of the main concerns, but not the only one, > so I am not entirely convinced that a full FUSE-BFP implementation would > solve all my problems. > > When scaling to many millions of passthrough inodes, resource usage start > becoming a limitation of a FUSE passthrough implementation and memory > reclaim of native fs works a lot better than memory reclaim over FUSE over > another native fs. > > When the workload works on the native filesystem, it is also possible to > use native fs features (e.g. XFS ioctls). OK, understood. Out of curiosity you've mentioned you'd looked into implementing HSM in overlayfs. What are the issues there? I assume performance is very close to native one so that is likely not an issue and resource usage you mention above likely is not that bad either. So I guess it is that you don't want to invent hooks for userspace for moving (parts of) files between offline storage and the local cache? > Questions: > - What do you think about the direction this POC has taken so far? > - Is there anything specific that you would like to see in the POC > to be convinced that this API will be useful? I think your POC is taking a good direction and your discussion with Dave had made me more confident that this is all workable :). I liked your idea of the wiki (or whatever form of documentation) that summarizes what we've discussed in this thread. That would be actually pretty nice for future reference. The remaining concern I have is that we should demonstrate the solution is able to scale to millions of inodes (and likely more) because AFAIU that are the sizes current HSM solutions are interested in. I guess this is kind of covered in your last step of POCs though. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-22 10:48 ` Jan Kara @ 2022-09-22 13:03 ` Amir Goldstein 2022-09-26 15:27 ` Jan Kara 2022-10-28 12:50 ` Amir Goldstein 1 sibling, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-09-22 13:03 UTC (permalink / raw) To: Jan Kara Cc: Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel, Fufu Fang On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 20-09-22 21:19:25, Amir Goldstein wrote: > > On Wed, Sep 14, 2022 at 2:52 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > > So I'd prefer to avoid the major API > > > > > > > extension unless there are serious users out there - perhaps we will even > > > > > > > need to develop the kernel API in cooperation with the userspace part to > > > > > > > verify the result is actually usable and useful. > > > > > > > > > > Yap. It should be trivial to implement a "mirror" HSM backend. > > > > > For example, the libprojfs [5] projects implements a MirrorProvider > > > > > backend for the Microsoft ProjFS [6] HSM API. > > > > > > > > Well, validating that things work using some simple backend is one thing > > > > but we are probably also interested in whether the result is practical to > > > > use - i.e., whether the performance meets the needs, whether the API is not > > > > cumbersome for what HSM solutions need to do, whether the more advanced > > > > features like range-support are useful the way they are implemented etc. > > > > We can verify some of these things with simple mirror HSM backend but I'm > > > > afraid some of the problems may become apparent only once someone actually > > > > uses the result in practice and for that we need a userspace counterpart > > > > that does actually something useful so that people have motivation to use > > > > it :). > > > > > > > Hi Jan, > > > > I wanted to give an update on the POC that I am working on. > > I decided to find a FUSE HSM and show how it may be converted > > to use fanotify HSM hooks. > > > > HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local > > cache from a remote http on first access to every directory and file range. > > > > Normally, it would be run like this: > > ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/ > > > > Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements > > passthrough calls to the local cache dir if cache is already populated. > > > > After my conversion patches [1], this download-only HSM can be run like > > this without mounting FUSE: > > > > sudo ./httpdirfs --fanotify --cache-location /vdf/cache > > https://cdn.kernel.org/pub/ - > > > > [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content > > > > Browsing the cache directory at /vdf/cache, lazyly populates the local cache > > using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content > > using FAN_ACCESS_PERM read hooks. > > > > Up to this point, the implementation did not require any kernel changes. > > However, this type of command does not populate the path components, > > because lookup does not generate FAN_ACCESS_PERM event: > > > > stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz > > > > To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM > > event [2] and used it to lazyly populate directories in the path ancestry. > > For now, I stuck with the XXX_PERM convention and did not require > > FAN_CLASS_PRE_CONTENT, although we probably should. > > > > [2] https://github.com/amir73il/linux/commits/fanotify_pre_content > > > > Streaming read of large files works as well, but only for sequential read > > patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events > > do not (yet) carry range info, so my naive implementation downloads > > one extra data chunk on each FAN_ACCESS_PERM until the cache file is full. > > > > This makes it possible to run commands like: > > > > tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz > > | less > > > > without having to wait for the entire 400MB file to download before > > seeing the first page. > > > > This streaming feature is extremely important for modern HSMs > > that are often used to archive large media files in the cloud. > > Thanks for update Amir! I've glanced through the series and so far it looks > pretty simple and I'd have only some style / readability nits (but let's > resolve those once we have something more complete). > > When thinking about HSM (and while following your discussion with Dave) I > wondered about one thing: When the notifications happen before we take > locks, then we are in principle prone to time-to-check-time-to-use races, > aren't we? How are these resolved? > > For example something like: > We have file with size 16k. > Reader: Writer > read 8k at offset 12k > -> notification sent > - HSM makes sure 12-16k is here and 16-20k is beyond eof so nothing to do > > expand file to 20k > - now the file contents must not get moved out until the reader is > done in order not to break it > Good question. The way I was considering to resolve this is that for evicting file content, HSM would: 1. Try to take a write lease (F_SETLEASE) on the file to make sure no other users are accessing the file 2. Set the mark or persistent NODATA bit on the file 3. Drop the write lease 4. Evict file content You can think of it as upgrading from a breakable write lease to an unbreakable write lease. This way, users cannot access or modify file content without either triggering FAN_ACCESS_PERM/FAN_MODIFY_PERM or preventing HSM from evicting file content until close(). And this fix in linux-next: https://lore.kernel.org/linux-fsdevel/20220816145317.710368-1-amir73il@gmail.com/ makes sure that those users cannot open a file without either triggering FAN_OPEN_PERM or preventing HSM from evicting file content until close(). HSM signal handler for lease break would be rare and only needs to drop the lease. If lease is dropped before the mark was successfully set, abort the content evict. I hope I got this right - this is a design that I did not POC yet, but it is on my TODO. BTW, evicting directory content is not something that I have on my radar at the moment. Not sure how many HSMs do that anyway. FWIW, it was never part of DMAPI. > > For the next steps of POC, I could do: > > - Report FAN_ACCESS_PERM range info to implement random read > > patterns (e.g. unzip -l) > > - Introduce FAN_MODIFY_PERM, so file content could be downloaded > > before modifying a read-write HSM cache > > - Demo conversion of a read-write FUSE HSM implementation > > (e.g. https://github.com/volga629/davfs2) > > - Demo HSM with filesystem mark [*] and a hardcoded test filter > > > > [*] Note that unlike the case with recursive inotify, this POC HSM > > implementation is not racy, because of the lookup permission events. > > A filesystem mark is still needed to avoid pinning all the unpopulated > > cache tree leaf entries to inode cache, so that this HSM could work on > > a very large scale tree, the same as my original use case for implementing > > filesystem mark. > > Sounds good! Just with your concern about pinning - can't you use evictable > marks added on lookup for files / dirs you want to track? Maybe it isn't > great design for other reasons but it would save you some event > filtering... > With the current POC, there is no trigger to re-establish the evicted mark, because the parent is already populated and has no mark. A hook on instantiate of inode in inode cache could fill that gap. It could still be useful to filter FAN_INSTANTIATE_PERM events in the kernel but it is not a must because instantiate is more rare than (say) lookup and then the fast lookup path (RCU walk) on populated trees suffers almost no overhead when the filesystem is watched. Please think about this and let me know if you think that this is a direction worth pursuing, now, or as a later optimization. > > If what you are looking for is an explanation why fanotify HSM would be better > > than a FUSE HSM implementation then there are several reasons. > > Performance is at the top of the list. There is this famous USENIX paper [3] > > about FUSE passthrough performance. > > It is a bit outdated, but many parts are still relevant - you can ask > > the Android > > developers why they decided to work on FUSE-BFP... > > > > [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf > > > > For me, performance is one of the main concerns, but not the only one, > > so I am not entirely convinced that a full FUSE-BFP implementation would > > solve all my problems. > > > > When scaling to many millions of passthrough inodes, resource usage start > > becoming a limitation of a FUSE passthrough implementation and memory > > reclaim of native fs works a lot better than memory reclaim over FUSE over > > another native fs. > > > > When the workload works on the native filesystem, it is also possible to > > use native fs features (e.g. XFS ioctls). > > OK, understood. Out of curiosity you've mentioned you'd looked into > implementing HSM in overlayfs. What are the issues there? I assume > performance is very close to native one so that is likely not an issue and > resource usage you mention above likely is not that bad either. So I guess > it is that you don't want to invent hooks for userspace for moving (parts > of) files between offline storage and the local cache? > In a nutshell, when realizing that overlayfs needs userspace hooks to cater HSM, it becomes quite useless to use a stacked fs design. Performance is not a problem with overlayfs, but like with FUSE, all the inodes/dentries in the system double, memory reclaim of layered fs becomes an awkward dance, which messes with the special logic of xfs shrinkers, and on top of all this, overlayfs does not proxy all the XFS ioctls either. The fsnotify hooks are a much better design when realizing that the likely() case is to do nothing and incur least overhead and the unlikely() case of user hook is rare. > > Questions: > > - What do you think about the direction this POC has taken so far? > > - Is there anything specific that you would like to see in the POC > > to be convinced that this API will be useful? > > I think your POC is taking a good direction and your discussion with Dave > had made me more confident that this is all workable :). I liked your idea > of the wiki (or whatever form of documentation) that summarizes what we've > discussed in this thread. That would be actually pretty nice for future > reference. Yeh, I need that wiki to organize my thoughts as well ;) > > The remaining concern I have is that we should demonstrate the solution is > able to scale to millions of inodes (and likely more) because AFAIU that > are the sizes current HSM solutions are interested in. I guess this is kind > of covered in your last step of POCs though. > Well, in $WORK we have performance test setups for those workloads, so part of my plan is to convert the in-house FUSE HSM to fanotify and make sure that all those tests do not regress. But that is not code, nor tests that I can release, I can only report back that the POC works and show the building blocks that I used on some open source code base. I plan to do the open source small scale POC first to show the building blocks so you could imagine the end results and then take the building blocks for a test drive in the real world. I've put my eye on davfs2 [1] as the code base for read-write HSM POC, but maybe I will find an S3 FUSE fs that could work too I am open to other suggestions. [1] https://github.com/volga629/davfs2 When DeepSpace Storage releases their product to github, I will be happy to work with them on a POC with their code base and I bet they could arrange a large scale test setup. (hint hint). Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-22 13:03 ` Amir Goldstein @ 2022-09-26 15:27 ` Jan Kara 2022-09-28 12:29 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-09-26 15:27 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel, Fufu Fang On Thu 22-09-22 16:03:41, Amir Goldstein wrote: > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 20-09-22 21:19:25, Amir Goldstein wrote: > > > For the next steps of POC, I could do: > > > - Report FAN_ACCESS_PERM range info to implement random read > > > patterns (e.g. unzip -l) > > > - Introduce FAN_MODIFY_PERM, so file content could be downloaded > > > before modifying a read-write HSM cache > > > - Demo conversion of a read-write FUSE HSM implementation > > > (e.g. https://github.com/volga629/davfs2) > > > - Demo HSM with filesystem mark [*] and a hardcoded test filter > > > > > > [*] Note that unlike the case with recursive inotify, this POC HSM > > > implementation is not racy, because of the lookup permission events. > > > A filesystem mark is still needed to avoid pinning all the unpopulated > > > cache tree leaf entries to inode cache, so that this HSM could work on > > > a very large scale tree, the same as my original use case for implementing > > > filesystem mark. > > > > Sounds good! Just with your concern about pinning - can't you use evictable > > marks added on lookup for files / dirs you want to track? Maybe it isn't > > great design for other reasons but it would save you some event > > filtering... > > > > With the current POC, there is no trigger to re-establish the evicted mark, > because the parent is already populated and has no mark. So my original thinking was that you'd place FAN_LOOKUP_PERM mark on top of the directory tree and then you'd add evictable marks to all the subdirs that are looked up from the FAN_LOOKUP_PERM event handler. That way I'd imagine you can place evictable marks on all directories that are used in a race-free manner. > A hook on instantiate of inode in inode cache could fill that gap. > It could still be useful to filter FAN_INSTANTIATE_PERM events in the > kernel but it is not a must because instantiate is more rare than (say) lookup > and then the fast lookup path (RCU walk) on populated trees suffers almost > no overhead when the filesystem is watched. > > Please think about this and let me know if you think that this is a direction > worth pursuing, now, or as a later optimization. I think an event on instantiate seems to be depending too much on kernel internals instead of obvious filesystem operations. Also it might be a bit challenging during startup when you don't know what is cached and what not so you cannot rely on instantiate events for placing marks. So I'd leave this for future optimization. > > > If what you are looking for is an explanation why fanotify HSM would be better > > > than a FUSE HSM implementation then there are several reasons. > > > Performance is at the top of the list. There is this famous USENIX paper [3] > > > about FUSE passthrough performance. > > > It is a bit outdated, but many parts are still relevant - you can ask > > > the Android > > > developers why they decided to work on FUSE-BFP... > > > > > > [3] https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf > > > > > > For me, performance is one of the main concerns, but not the only one, > > > so I am not entirely convinced that a full FUSE-BFP implementation would > > > solve all my problems. > > > > > > When scaling to many millions of passthrough inodes, resource usage start > > > becoming a limitation of a FUSE passthrough implementation and memory > > > reclaim of native fs works a lot better than memory reclaim over FUSE over > > > another native fs. > > > > > > When the workload works on the native filesystem, it is also possible to > > > use native fs features (e.g. XFS ioctls). > > > > OK, understood. Out of curiosity you've mentioned you'd looked into > > implementing HSM in overlayfs. What are the issues there? I assume > > performance is very close to native one so that is likely not an issue and > > resource usage you mention above likely is not that bad either. So I guess > > it is that you don't want to invent hooks for userspace for moving (parts > > of) files between offline storage and the local cache? > > In a nutshell, when realizing that overlayfs needs userspace hooks > to cater HSM, it becomes quite useless to use a stacked fs design. > > Performance is not a problem with overlayfs, but like with FUSE, > all the inodes/dentries in the system double, memory reclaim > of layered fs becomes an awkward dance, which messes with the > special logic of xfs shrinkers, and on top of all this, overlayfs does > not proxy all the XFS ioctls either. > > The fsnotify hooks are a much better design when realizing that > the likely() case is to do nothing and incur least overhead and > the unlikely() case of user hook is rare. OK, understood. Thanks! > > The remaining concern I have is that we should demonstrate the solution is > > able to scale to millions of inodes (and likely more) because AFAIU that > > are the sizes current HSM solutions are interested in. I guess this is kind > > of covered in your last step of POCs though. > > > > Well, in $WORK we have performance test setups for those workloads, > so part of my plan is to convert the in-house FUSE HSM > to fanotify and make sure that all those tests do not regress. > But that is not code, nor tests that I can release, I can only report back > that the POC works and show the building blocks that I used on > some open source code base. Even this is useful I think. > I plan to do the open source small scale POC first to show the > building blocks so you could imagine the end results and > then take the building blocks for a test drive in the real world. > > I've put my eye on davfs2 [1] as the code base for read-write HSM > POC, but maybe I will find an S3 FUSE fs that could work too > I am open to other suggestions. > > [1] https://github.com/volga629/davfs2 > > When DeepSpace Storage releases their product to github, > I will be happy to work with them on a POC with their code > base and I bet they could arrange a large scale test setup. > (hint hint). :-) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-26 15:27 ` Jan Kara @ 2022-09-28 12:29 ` Amir Goldstein 2022-09-29 10:01 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-09-28 12:29 UTC (permalink / raw) To: Jan Kara Cc: Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel, Fufu Fang, Dave Chinner On Mon, Sep 26, 2022 at 6:27 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 22-09-22 16:03:41, Amir Goldstein wrote: > > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > > On Tue 20-09-22 21:19:25, Amir Goldstein wrote: > > > > For the next steps of POC, I could do: > > > > - Report FAN_ACCESS_PERM range info to implement random read > > > > patterns (e.g. unzip -l) > > > > - Introduce FAN_MODIFY_PERM, so file content could be downloaded > > > > before modifying a read-write HSM cache > > > > - Demo conversion of a read-write FUSE HSM implementation > > > > (e.g. https://github.com/volga629/davfs2) > > > > - Demo HSM with filesystem mark [*] and a hardcoded test filter > > > > > > > > [*] Note that unlike the case with recursive inotify, this POC HSM > > > > implementation is not racy, because of the lookup permission events. > > > > A filesystem mark is still needed to avoid pinning all the unpopulated > > > > cache tree leaf entries to inode cache, so that this HSM could work on > > > > a very large scale tree, the same as my original use case for implementing > > > > filesystem mark. > > > > > > Sounds good! Just with your concern about pinning - can't you use evictable > > > marks added on lookup for files / dirs you want to track? Maybe it isn't > > > great design for other reasons but it would save you some event > > > filtering... > > > > > > > With the current POC, there is no trigger to re-establish the evicted mark, > > because the parent is already populated and has no mark. > > So my original thinking was that you'd place FAN_LOOKUP_PERM mark on top of > the directory tree and then you'd add evictable marks to all the subdirs > that are looked up from the FAN_LOOKUP_PERM event handler. That way I'd > imagine you can place evictable marks on all directories that are used in a > race-free manner. > Maybe I am missing something. I don't see how that can scale up to provide penalty free fast path lookup for fully populated subtrees. > > A hook on instantiate of inode in inode cache could fill that gap. > > It could still be useful to filter FAN_INSTANTIATE_PERM events in the > > kernel but it is not a must because instantiate is more rare than (say) lookup > > and then the fast lookup path (RCU walk) on populated trees suffers almost > > no overhead when the filesystem is watched. > > > > Please think about this and let me know if you think that this is a direction > > worth pursuing, now, or as a later optimization. > > I think an event on instantiate seems to be depending too much on kernel > internals instead of obvious filesystem operations. Also it might be a bit > challenging during startup when you don't know what is cached and what not > so you cannot rely on instantiate events for placing marks. So I'd leave > this for future optimization. > Perhaps a user FAN_INSTANTIATE_PERM is too much, but I was trying to figure out a way to make automatic inode marks work. If we can define reasonable use cases for automatic inode marks that kernel can implement (e.g. inheriting from parent on dentry instantiate) then this could possibly get us something useful. Maybe that is what you meant with the suggestion above? The other use case of automatic inode marks I was thinking about, which are even more relevant for $SUBJECT is this: When instantiating a dentry with an inode that has xattr "security.fanotify.mask" (a.k.a. persistent inode mark), an inode mark could be auto created and attached to a group with a special sb mark (we can limit a single special mark per sb). This could be implemented similar to get_acl(), where i_fsnotify_mask is always initialized with a special value (i.e. FS_UNINITIALIZED) which is set to either 0 or non-zero if "security.fanotify.mask" exists. The details of how such an API would look like are very unclear to me, so I will try to focus on the recursive auto inode mark idea. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-28 12:29 ` Amir Goldstein @ 2022-09-29 10:01 ` Jan Kara 2022-10-07 13:58 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-09-29 10:01 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Miklos Szeredi, Plaster, Robert, David Howells, linux-fsdevel, Fufu Fang, Dave Chinner On Wed 28-09-22 15:29:13, Amir Goldstein wrote: > On Mon, Sep 26, 2022 at 6:27 PM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 22-09-22 16:03:41, Amir Goldstein wrote: > > > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > > > On Tue 20-09-22 21:19:25, Amir Goldstein wrote: > > > > > For the next steps of POC, I could do: > > > > > - Report FAN_ACCESS_PERM range info to implement random read > > > > > patterns (e.g. unzip -l) > > > > > - Introduce FAN_MODIFY_PERM, so file content could be downloaded > > > > > before modifying a read-write HSM cache > > > > > - Demo conversion of a read-write FUSE HSM implementation > > > > > (e.g. https://github.com/volga629/davfs2) > > > > > - Demo HSM with filesystem mark [*] and a hardcoded test filter > > > > > > > > > > [*] Note that unlike the case with recursive inotify, this POC HSM > > > > > implementation is not racy, because of the lookup permission events. > > > > > A filesystem mark is still needed to avoid pinning all the unpopulated > > > > > cache tree leaf entries to inode cache, so that this HSM could work on > > > > > a very large scale tree, the same as my original use case for implementing > > > > > filesystem mark. > > > > > > > > Sounds good! Just with your concern about pinning - can't you use evictable > > > > marks added on lookup for files / dirs you want to track? Maybe it isn't > > > > great design for other reasons but it would save you some event > > > > filtering... > > > > > > > > > > With the current POC, there is no trigger to re-establish the evicted mark, > > > because the parent is already populated and has no mark. > > > > So my original thinking was that you'd place FAN_LOOKUP_PERM mark on top of > > the directory tree and then you'd add evictable marks to all the subdirs > > that are looked up from the FAN_LOOKUP_PERM event handler. That way I'd > > imagine you can place evictable marks on all directories that are used in a > > race-free manner. > > > > Maybe I am missing something. > I don't see how that can scale up to provide penalty free fast path lookup > for fully populated subtrees. No, you are right that this scheme would have non-trivial overhead in processing the lookup events even when the tree is fully populated. > > > A hook on instantiate of inode in inode cache could fill that gap. > > > It could still be useful to filter FAN_INSTANTIATE_PERM events in the > > > kernel but it is not a must because instantiate is more rare than (say) lookup > > > and then the fast lookup path (RCU walk) on populated trees suffers almost > > > no overhead when the filesystem is watched. > > > > > > Please think about this and let me know if you think that this is a direction > > > worth pursuing, now, or as a later optimization. > > > > I think an event on instantiate seems to be depending too much on kernel > > internals instead of obvious filesystem operations. Also it might be a bit > > challenging during startup when you don't know what is cached and what not > > so you cannot rely on instantiate events for placing marks. So I'd leave > > this for future optimization. > > > > Perhaps a user FAN_INSTANTIATE_PERM is too much, but I was > trying to figure out a way to make automatic inode marks work. > If we can define reasonable use cases for automatic inode marks that > kernel can implement (e.g. inheriting from parent on dentry instantiate) > then this could possibly get us something useful. > Maybe that is what you meant with the suggestion above? Well, my suggestion was pondering if we can implement something like automatic inode marks in userspace using FAN_LOOKUP_PERM event. But you are right the overhead in the fast path does not make this very attractive. So we'll have to look more into the in-kernel solution. > The other use case of automatic inode marks I was thinking about, > which are even more relevant for $SUBJECT is this: > When instantiating a dentry with an inode that has xattr > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode > mark could be auto created and attached to a group with a special sb > mark (we can limit a single special mark per sb). > This could be implemented similar to get_acl(), where i_fsnotify_mask > is always initialized with a special value (i.e. FS_UNINITIALIZED) > which is set to either 0 or non-zero if "security.fanotify.mask" exists. > > The details of how such an API would look like are very unclear to me, > so I will try to focus on the recursive auto inode mark idea. Yeah, although initializing fanotify marks based on xattrs does not look completely crazy I can see a lot of open questions there so I think automatic inode mark idea has more chances for success at this point :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-29 10:01 ` Jan Kara @ 2022-10-07 13:58 ` Amir Goldstein 2022-10-12 15:44 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-10-07 13:58 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner [reducing CC] > > The other use case of automatic inode marks I was thinking about, > > which are even more relevant for $SUBJECT is this: > > When instantiating a dentry with an inode that has xattr > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode > > mark could be auto created and attached to a group with a special sb > > mark (we can limit a single special mark per sb). > > This could be implemented similar to get_acl(), where i_fsnotify_mask > > is always initialized with a special value (i.e. FS_UNINITIALIZED) > > which is set to either 0 or non-zero if "security.fanotify.mask" exists. > > > > The details of how such an API would look like are very unclear to me, > > so I will try to focus on the recursive auto inode mark idea. > > Yeah, although initializing fanotify marks based on xattrs does not look > completely crazy I can see a lot of open questions there so I think > automatic inode mark idea has more chances for success at this point :). I realized that there is one sort of "persistent mark" who raises less questions - one that only has an ignore mask. ignore masks can have a "static" namespace that is not bound to any specific group, but rather a set of groups that join this namespace. I played with this idea and wrote some patches: https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask This may end up being useful in the HSM POC - i.e., HSM places persistent ignore mask for permission events for populated dirs/files and removes the persistent mask before punching a hole. Haven't forgotten about the promised wiki. For now, I just wanted to share this idea. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-10-07 13:58 ` Amir Goldstein @ 2022-10-12 15:44 ` Jan Kara 2022-10-12 16:28 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-10-12 15:44 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner Hi Amir! On Fri 07-10-22 16:58:21, Amir Goldstein wrote: > > > The other use case of automatic inode marks I was thinking about, > > > which are even more relevant for $SUBJECT is this: > > > When instantiating a dentry with an inode that has xattr > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode > > > mark could be auto created and attached to a group with a special sb > > > mark (we can limit a single special mark per sb). > > > This could be implemented similar to get_acl(), where i_fsnotify_mask > > > is always initialized with a special value (i.e. FS_UNINITIALIZED) > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists. > > > > > > The details of how such an API would look like are very unclear to me, > > > so I will try to focus on the recursive auto inode mark idea. > > > > Yeah, although initializing fanotify marks based on xattrs does not look > > completely crazy I can see a lot of open questions there so I think > > automatic inode mark idea has more chances for success at this point :). > > I realized that there is one sort of "persistent mark" who raises > less questions - one that only has an ignore mask. > > ignore masks can have a "static" namespace that is not bound to any > specific group, but rather a set of groups that join this namespace. > > I played with this idea and wrote some patches: > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask I have glanced over the patches. In general the idea looks OK to me but I have some concerns: 1) Technically, it may be challenging to call into filesystem xattr handling code on first event generated by the inode - that may generate some unexpected lock recursion for some filesystems and some events which trigger the initialization... 2) What if you set the xattr while the group is already listening to events? Currently the change will get ignored, won't it? But I guess this could be handled by clearing the "cached" flag when the xattr is set. 3) What if multiple applications want to use the persistent mark functionality? I think we need some way to associate a particular fanotify group with a particular subset of fanotify xattrs so that coexistence of multiple applications is possible... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-10-12 15:44 ` Jan Kara @ 2022-10-12 16:28 ` Amir Goldstein 2022-10-13 12:16 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-10-12 16:28 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote: > > Hi Amir! > > On Fri 07-10-22 16:58:21, Amir Goldstein wrote: > > > > The other use case of automatic inode marks I was thinking about, > > > > which are even more relevant for $SUBJECT is this: > > > > When instantiating a dentry with an inode that has xattr > > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode > > > > mark could be auto created and attached to a group with a special sb > > > > mark (we can limit a single special mark per sb). > > > > This could be implemented similar to get_acl(), where i_fsnotify_mask > > > > is always initialized with a special value (i.e. FS_UNINITIALIZED) > > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists. > > > > > > > > The details of how such an API would look like are very unclear to me, > > > > so I will try to focus on the recursive auto inode mark idea. > > > > > > Yeah, although initializing fanotify marks based on xattrs does not look > > > completely crazy I can see a lot of open questions there so I think > > > automatic inode mark idea has more chances for success at this point :). > > > > I realized that there is one sort of "persistent mark" who raises > > less questions - one that only has an ignore mask. > > > > ignore masks can have a "static" namespace that is not bound to any > > specific group, but rather a set of groups that join this namespace. > > > > I played with this idea and wrote some patches: > > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask > > I have glanced over the patches. In general the idea looks OK to me but I > have some concerns: > > 1) Technically, it may be challenging to call into filesystem xattr > handling code on first event generated by the inode - that may generate > some unexpected lock recursion for some filesystems and some events which > trigger the initialization... That may be a correct statement in general, but please note that - Only permission events trigger auto-init of xattr ignore mask - Permission events are called from security hooks - Security hooks may also call getxattr to get the security context Perhaps LSMs always initialize the security context in OPEN and never in ACCESS? One of the earlier versions of the patch initialized xattr ignore mask on LOOKUP permission event, if ANY object was interested in ANY permission event even if no object was interested in LOOKUP to mimic the LSM context initialization, but it was complicated and I wasn't sure if this was necessary. Maybe that is the way to go... > > 2) What if you set the xattr while the group is already listening to > events? Currently the change will get ignored, won't it? But I guess this > could be handled by clearing the "cached" flag when the xattr is set. > I have created an API to update the xattr via fanotify_mark(FAN_MARK_XATTR, ... which updates the cached ignore mask in the connector. I see no reason to support "direct" modifications of this xattr. If such changes are made directly it is fine to ignore them. > 3) What if multiple applications want to use the persistent mark > functionality? I think we need some way to associate a particular > fanotify group with a particular subset of fanotify xattrs so that > coexistence of multiple applications is possible... > Yeh, I thought about this as well. The API in the patches is quite naive because it implements a single global namespace for xattr ignore masks, but mostly I wanted to see if I can get the fast path and auto-init implementation done. I was generally thinking of ioctl() as the API to join an xattr marks namespace and negotiate the on-disk format of persistent marks supported by the application. I would not want to allow multiple fanotify xattrs per inode - that could have the consequence of the inode becoming a junkyard. I'd prefer to have a single xattr (say security.fanotify.mark) and that mark will have - on-disk format version - namespace id - ignore mask - etc If multiple applications want to use persistent marks they need to figure out how to work together without stepping on each other's toes. I don't think it is up to fanotify to coordinate that. fanotify_mark() can fail with EEXIST when a group that joined namespace A is trying to update a persistent mark when a persistent mark of namespace B already exists and probably some FAN_MARK_REPLACE flag could be used to force overwrite the existing persistent mark. Do you agree? Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-10-12 16:28 ` Amir Goldstein @ 2022-10-13 12:16 ` Amir Goldstein 2022-11-03 12:57 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-10-13 12:16 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Wed, Oct 12, 2022 at 7:28 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote: > > > > Hi Amir! > > > > On Fri 07-10-22 16:58:21, Amir Goldstein wrote: > > > > > The other use case of automatic inode marks I was thinking about, > > > > > which are even more relevant for $SUBJECT is this: > > > > > When instantiating a dentry with an inode that has xattr > > > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode > > > > > mark could be auto created and attached to a group with a special sb > > > > > mark (we can limit a single special mark per sb). > > > > > This could be implemented similar to get_acl(), where i_fsnotify_mask > > > > > is always initialized with a special value (i.e. FS_UNINITIALIZED) > > > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists. > > > > > > > > > > The details of how such an API would look like are very unclear to me, > > > > > so I will try to focus on the recursive auto inode mark idea. > > > > > > > > Yeah, although initializing fanotify marks based on xattrs does not look > > > > completely crazy I can see a lot of open questions there so I think > > > > automatic inode mark idea has more chances for success at this point :). > > > > > > I realized that there is one sort of "persistent mark" who raises > > > less questions - one that only has an ignore mask. > > > > > > ignore masks can have a "static" namespace that is not bound to any > > > specific group, but rather a set of groups that join this namespace. > > > > > > I played with this idea and wrote some patches: > > > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask > > > > I have glanced over the patches. In general the idea looks OK to me but I > > have some concerns: > > > > 1) Technically, it may be challenging to call into filesystem xattr > > handling code on first event generated by the inode - that may generate > > some unexpected lock recursion for some filesystems and some events which > > trigger the initialization... > > That may be a correct statement in general, but please note that > - Only permission events trigger auto-init of xattr ignore mask > - Permission events are called from security hooks > - Security hooks may also call getxattr to get the security context > > Perhaps LSMs always initialize the security context in OPEN and > never in ACCESS? > > One of the earlier versions of the patch initialized xattr ignore mask > on LOOKUP permission event, if ANY object was interested in ANY > permission event even if no object was interested in LOOKUP > to mimic the LSM context initialization, > but it was complicated and I wasn't sure if this was necessary. > Also, permission events can sleep by definition so why would getxattr not be safe in the context of permission events handling? > > > > > 2) What if you set the xattr while the group is already listening to > > events? Currently the change will get ignored, won't it? But I guess this > > could be handled by clearing the "cached" flag when the xattr is set. > > > > I have created an API to update the xattr via > fanotify_mark(FAN_MARK_XATTR, ... > which updates the cached ignore mask in the connector. > > I see no reason to support "direct" modifications of this xattr. > If such changes are made directly it is fine to ignore them. > > > 3) What if multiple applications want to use the persistent mark > > functionality? I think we need some way to associate a particular > > fanotify group with a particular subset of fanotify xattrs so that > > coexistence of multiple applications is possible... > > > > Yeh, I thought about this as well. > The API in the patches is quite naive because it implements a single > global namespace for xattr ignore masks, but mostly I wanted to > see if I can get the fast path and auto-init implementation done. > > I was generally thinking of ioctl() as the API to join an xattr marks > namespace and negotiate the on-disk format of persistent marks > supported by the application. > > I would not want to allow multiple fanotify xattrs per inode - > that could have the consequence of the inode becoming a junkyard. > > I'd prefer to have a single xattr (say security.fanotify.mark) > and that mark will have > - on-disk format version > - namespace id > - ignore mask > - etc > > If multiple applications want to use persistent marks they need to figure > out how to work together without stepping on each other's toes. > I don't think it is up to fanotify to coordinate that. > > fanotify_mark() can fail with EEXIST when a group that joined namespace A > is trying to update a persistent mark when a persistent mark of namespace B > already exists and probably some FAN_MARK_REPLACE flag could be used > to force overwrite the existing persistent mark. > One thing that I feel a bit silly about is something that I only fully noticed after writing this WIP wiki entry: https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#persistent-inode-marks Persistent marks (in xattr) with ignore mask are useful, but only a little bit more useful than Evictable marks with ignore mask. Persistent marks (in xattr) with a "positive" event mask are the real deal. Because with "positive" persistent marks, we will be able to optimize away srcu_read_lock() and marks iteration for the majority of fsnotify hooks by looking at objects interest masks (including the FS_XATTR_CACHED bit). The good thing about the POC patches [1] is that there is no technical limitation in this code for using persistent marks with positive event masks. It is a bit more challenging to document the fact that a "normal" fs/mount mark is needed in order to "activate" the persistent marks in the inodes of this fs/mount, but the changes to the code to support that would be minimal. Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-10-13 12:16 ` Amir Goldstein @ 2022-11-03 12:57 ` Jan Kara 2022-11-03 13:38 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-03 12:57 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner I'm sorry for the really delayed response. We had an internal conference and some stuff around that which made me busy. On Thu 13-10-22 15:16:25, Amir Goldstein wrote: > On Wed, Oct 12, 2022 at 7:28 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote: > > > > > > Hi Amir! > > > > > > On Fri 07-10-22 16:58:21, Amir Goldstein wrote: > > > > > > The other use case of automatic inode marks I was thinking about, > > > > > > which are even more relevant for $SUBJECT is this: > > > > > > When instantiating a dentry with an inode that has xattr > > > > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode > > > > > > mark could be auto created and attached to a group with a special sb > > > > > > mark (we can limit a single special mark per sb). > > > > > > This could be implemented similar to get_acl(), where i_fsnotify_mask > > > > > > is always initialized with a special value (i.e. FS_UNINITIALIZED) > > > > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists. > > > > > > > > > > > > The details of how such an API would look like are very unclear to me, > > > > > > so I will try to focus on the recursive auto inode mark idea. > > > > > > > > > > Yeah, although initializing fanotify marks based on xattrs does not look > > > > > completely crazy I can see a lot of open questions there so I think > > > > > automatic inode mark idea has more chances for success at this point :). > > > > > > > > I realized that there is one sort of "persistent mark" who raises > > > > less questions - one that only has an ignore mask. > > > > > > > > ignore masks can have a "static" namespace that is not bound to any > > > > specific group, but rather a set of groups that join this namespace. > > > > > > > > I played with this idea and wrote some patches: > > > > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask > > > > > > I have glanced over the patches. In general the idea looks OK to me but I > > > have some concerns: > > > > > > 1) Technically, it may be challenging to call into filesystem xattr > > > handling code on first event generated by the inode - that may generate > > > some unexpected lock recursion for some filesystems and some events which > > > trigger the initialization... > > > > That may be a correct statement in general, but please note that > > - Only permission events trigger auto-init of xattr ignore mask > > - Permission events are called from security hooks > > - Security hooks may also call getxattr to get the security context > > > > Perhaps LSMs always initialize the security context in OPEN and > > never in ACCESS? > > > > One of the earlier versions of the patch initialized xattr ignore mask > > on LOOKUP permission event, if ANY object was interested in ANY > > permission event even if no object was interested in LOOKUP > > to mimic the LSM context initialization, > > but it was complicated and I wasn't sure if this was necessary. > > > > Also, permission events can sleep by definition > so why would getxattr not be safe in the > context of permission events handling? Well, I'm not afraid of sleeping. I was more worried about lock ordering issues. But are right that this probably isn't going to be an issue. > > > 2) What if you set the xattr while the group is already listening to > > > events? Currently the change will get ignored, won't it? But I guess this > > > could be handled by clearing the "cached" flag when the xattr is set. > > > > > > > I have created an API to update the xattr via > > fanotify_mark(FAN_MARK_XATTR, ... > > which updates the cached ignore mask in the connector. > > > > I see no reason to support "direct" modifications of this xattr. > > If such changes are made directly it is fine to ignore them. > > > > > 3) What if multiple applications want to use the persistent mark > > > functionality? I think we need some way to associate a particular > > > fanotify group with a particular subset of fanotify xattrs so that > > > coexistence of multiple applications is possible... > > > > > > > Yeh, I thought about this as well. > > The API in the patches is quite naive because it implements a single > > global namespace for xattr ignore masks, but mostly I wanted to > > see if I can get the fast path and auto-init implementation done. > > > > I was generally thinking of ioctl() as the API to join an xattr marks > > namespace and negotiate the on-disk format of persistent marks > > supported by the application. > > > > I would not want to allow multiple fanotify xattrs per inode - > > that could have the consequence of the inode becoming a junkyard. > > > > I'd prefer to have a single xattr (say security.fanotify.mark) > > and that mark will have > > - on-disk format version > > - namespace id > > - ignore mask > > - etc > > > > If multiple applications want to use persistent marks they need to figure > > out how to work together without stepping on each other's toes. > > I don't think it is up to fanotify to coordinate that. I'm not sure if this really scales. Imagine you have your say backup application that wants to use persistent marks and then you have your HSM application wanting to do the same. Or you have some daemon caching preparsed contents of config/ directory and watching whether it needs to rescan the dir and rebuild the cache using persistent marks (I'm hearing requests like these for persistent marks from desktop people for many years). How exactly are these going to coordinate? I fully understand your concern about the clutter in inode xattrs but if we're going to limit the kernel to support only one persistent marks user, then IMO we also need to provide a userspace counterpart (in the form of some service or library like persistent change notification journal) that will handle the coordination. Because otherwise it will become a mess rather quickly. > > fanotify_mark() can fail with EEXIST when a group that joined namespace A > > is trying to update a persistent mark when a persistent mark of namespace B > > already exists and probably some FAN_MARK_REPLACE flag could be used > > to force overwrite the existing persistent mark. > > One thing that I feel a bit silly about is something that I only fully > noticed after writing this WIP wiki entry: > https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#persistent-inode-marks > > Persistent marks (in xattr) with ignore mask are useful, but only a > little bit more useful than Evictable marks with ignore mask. > > Persistent marks (in xattr) with a "positive" event mask are the real deal. > Because with "positive" persistent marks, we will be able to optimize away > srcu_read_lock() and marks iteration for the majority of fsnotify hooks > by looking at objects interest masks (including the FS_XATTR_CACHED bit). > > The good thing about the POC patches [1] is that there is no technical > limitation in this code for using persistent marks with positive event > masks. It is a bit more challenging to document the fact that a "normal" > fs/mount mark is needed in order to "activate" the persistent marks in > the inodes of this fs/mount, but the changes to the code to support that > would be minimal. I agree persistent positive marks are very interesting (in fact I've heard requests for functionality like this about 15 years ago :)). But if you'd like to use them e.g. for backup or HSM then you need to somehow make sure you didn't miss any events before you created the activation mark? That looks like a bit unpleasant race with possible data (backup) corruption consequences? > [1] https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-03 12:57 ` Jan Kara @ 2022-11-03 13:38 ` Amir Goldstein 0 siblings, 0 replies; 43+ messages in thread From: Amir Goldstein @ 2022-11-03 13:38 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Thu, Nov 3, 2022 at 2:57 PM Jan Kara <jack@suse.cz> wrote: > > I'm sorry for the really delayed response. We had an internal conference > and some stuff around that which made me busy. > No problem at all. There is a lot to process in this thread. My follow up email about avoiding TOCTOU is worse... I am happy for whatever feedback you can provide me when you have the time. > On Thu 13-10-22 15:16:25, Amir Goldstein wrote: > > On Wed, Oct 12, 2022 at 7:28 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > Hi Amir! > > > > > > > > On Fri 07-10-22 16:58:21, Amir Goldstein wrote: > > > > > > > The other use case of automatic inode marks I was thinking about, > > > > > > > which are even more relevant for $SUBJECT is this: > > > > > > > When instantiating a dentry with an inode that has xattr > > > > > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode > > > > > > > mark could be auto created and attached to a group with a special sb > > > > > > > mark (we can limit a single special mark per sb). > > > > > > > This could be implemented similar to get_acl(), where i_fsnotify_mask > > > > > > > is always initialized with a special value (i.e. FS_UNINITIALIZED) > > > > > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists. > > > > > > > > > > > > > > The details of how such an API would look like are very unclear to me, > > > > > > > so I will try to focus on the recursive auto inode mark idea. > > > > > > > > > > > > Yeah, although initializing fanotify marks based on xattrs does not look > > > > > > completely crazy I can see a lot of open questions there so I think > > > > > > automatic inode mark idea has more chances for success at this point :). > > > > > > > > > > I realized that there is one sort of "persistent mark" who raises > > > > > less questions - one that only has an ignore mask. > > > > > > > > > > ignore masks can have a "static" namespace that is not bound to any > > > > > specific group, but rather a set of groups that join this namespace. > > > > > > > > > > I played with this idea and wrote some patches: > > > > > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask > > > > > > > > I have glanced over the patches. In general the idea looks OK to me but I > > > > have some concerns: > > > > > > > > 1) Technically, it may be challenging to call into filesystem xattr > > > > handling code on first event generated by the inode - that may generate > > > > some unexpected lock recursion for some filesystems and some events which > > > > trigger the initialization... > > > > > > That may be a correct statement in general, but please note that > > > - Only permission events trigger auto-init of xattr ignore mask > > > - Permission events are called from security hooks > > > - Security hooks may also call getxattr to get the security context > > > > > > Perhaps LSMs always initialize the security context in OPEN and > > > never in ACCESS? > > > > > > One of the earlier versions of the patch initialized xattr ignore mask > > > on LOOKUP permission event, if ANY object was interested in ANY > > > permission event even if no object was interested in LOOKUP > > > to mimic the LSM context initialization, > > > but it was complicated and I wasn't sure if this was necessary. > > > > > > > Also, permission events can sleep by definition > > so why would getxattr not be safe in the > > context of permission events handling? > > Well, I'm not afraid of sleeping. I was more worried about lock ordering > issues. But are right that this probably isn't going to be an issue. > > > > > 2) What if you set the xattr while the group is already listening to > > > > events? Currently the change will get ignored, won't it? But I guess this > > > > could be handled by clearing the "cached" flag when the xattr is set. > > > > > > > > > > I have created an API to update the xattr via > > > fanotify_mark(FAN_MARK_XATTR, ... > > > which updates the cached ignore mask in the connector. > > > > > > I see no reason to support "direct" modifications of this xattr. > > > If such changes are made directly it is fine to ignore them. > > > > > > > 3) What if multiple applications want to use the persistent mark > > > > functionality? I think we need some way to associate a particular > > > > fanotify group with a particular subset of fanotify xattrs so that > > > > coexistence of multiple applications is possible... > > > > > > > > > > Yeh, I thought about this as well. > > > The API in the patches is quite naive because it implements a single > > > global namespace for xattr ignore masks, but mostly I wanted to > > > see if I can get the fast path and auto-init implementation done. > > > > > > I was generally thinking of ioctl() as the API to join an xattr marks > > > namespace and negotiate the on-disk format of persistent marks > > > supported by the application. > > > > > > I would not want to allow multiple fanotify xattrs per inode - > > > that could have the consequence of the inode becoming a junkyard. > > > > > > I'd prefer to have a single xattr (say security.fanotify.mark) > > > and that mark will have > > > - on-disk format version > > > - namespace id > > > - ignore mask > > > - etc > > > > > > If multiple applications want to use persistent marks they need to figure > > > out how to work together without stepping on each other's toes. > > > I don't think it is up to fanotify to coordinate that. > > I'm not sure if this really scales. Imagine you have your say backup > application that wants to use persistent marks and then you have your HSM > application wanting to do the same. Or you have some daemon caching > preparsed contents of config/ directory and watching whether it needs to > rescan the dir and rebuild the cache using persistent marks (I'm hearing > requests like these for persistent marks from desktop people for many > years). How exactly are these going to coordinate? > > I fully understand your concern about the clutter in inode xattrs but if > we're going to limit the kernel to support only one persistent marks user, > then IMO we also need to provide a userspace counterpart (in the form of > some service or library like persistent change notification journal) that > will handle the coordination. Because otherwise it will become a mess > rather quickly. > The concept of singleton userspace services exists, but getting their UAPI right is a challenge. We can draw inspiration from Windows, which is decades a head of Linux w.r.t persistent fs notifications: https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine IIUC, the UAPI allows a single CloudSync engine to register per fs (or subtree root) to get the callbacks from a file with the persistent marks (called reparse points) in the "Windows.Storage.Provider" namespace. IOW, you may have OneDrive content provider or GoogleDrive content provider serving the persistent marks on a specific directory, never both. But Windows does support different namespaces for reparse points, so it's an example for both sides of our arguments. > > > fanotify_mark() can fail with EEXIST when a group that joined namespace A > > > is trying to update a persistent mark when a persistent mark of namespace B > > > already exists and probably some FAN_MARK_REPLACE flag could be used > > > to force overwrite the existing persistent mark. > > > > One thing that I feel a bit silly about is something that I only fully > > noticed after writing this WIP wiki entry: > > https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#persistent-inode-marks > > > > Persistent marks (in xattr) with ignore mask are useful, but only a > > little bit more useful than Evictable marks with ignore mask. > > > > Persistent marks (in xattr) with a "positive" event mask are the real deal. > > Because with "positive" persistent marks, we will be able to optimize away > > srcu_read_lock() and marks iteration for the majority of fsnotify hooks > > by looking at objects interest masks (including the FS_XATTR_CACHED bit). > > > > The good thing about the POC patches [1] is that there is no technical > > limitation in this code for using persistent marks with positive event > > masks. It is a bit more challenging to document the fact that a "normal" > > fs/mount mark is needed in order to "activate" the persistent marks in > > the inodes of this fs/mount, but the changes to the code to support that > > would be minimal. > > I agree persistent positive marks are very interesting (in fact I've heard > requests for functionality like this about 15 years ago :)). But if you'd > like to use them e.g. for backup or HSM then you need to somehow make sure > you didn't miss any events before you created the activation mark? That > looks like a bit unpleasant race with possible data (backup) corruption > consequences? > Yeh, positive persistent marks is quite futuristic at this point. My POC will stick to mount mark and inode ignore marks, where these races are rather easy to avoid using FAN_MARK_SYNC (see next email). Anyway, I have made a decision to aim for an initial HSM UAPI implementation with evictable ignore marks (without persistent marks) because evictable marks are functionally sufficient and persistent marks have too many UAPI open questions. I've modified the POC code to work to try persistent ignore marks and fall back to evictable ignore marks, which incur a few extra userspace callbacks after service restart or after drop caches. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-22 10:48 ` Jan Kara 2022-09-22 13:03 ` Amir Goldstein @ 2022-10-28 12:50 ` Amir Goldstein 2022-11-03 16:30 ` Jan Kara 1 sibling, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-10-28 12:50 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 20-09-22 21:19:25, Amir Goldstein wrote: [...] > > Hi Jan, > > > > I wanted to give an update on the POC that I am working on. > > I decided to find a FUSE HSM and show how it may be converted > > to use fanotify HSM hooks. > > > > HTTPDirFS is a read-only FUSE filesystem that lazyly populates a local > > cache from a remote http on first access to every directory and file range. > > > > Normally, it would be run like this: > > ./httpdirfs --cache-location /vdf/cache https://cdn.kernel.org/pub/ /mnt/pub/ > > > > Content is accessed via FUSE mount as /mnt/pub/ and FUSE implements > > passthrough calls to the local cache dir if cache is already populated. > > > > After my conversion patches [1], this download-only HSM can be run like > > this without mounting FUSE: > > > > sudo ./httpdirfs --fanotify --cache-location /vdf/cache > > https://cdn.kernel.org/pub/ - > > > > [1] https://github.com/amir73il/httpdirfs/commits/fanotify_pre_content > > > > Browsing the cache directory at /vdf/cache, lazyly populates the local cache > > using FAN_ACCESS_PERM readdir hooks and lazyly downloads files content > > using FAN_ACCESS_PERM read hooks. > > > > Up to this point, the implementation did not require any kernel changes. > > However, this type of command does not populate the path components, > > because lookup does not generate FAN_ACCESS_PERM event: > > > > stat /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz > > > > To bridge that functionality gap, I've implemented the FAN_LOOKUP_PERM > > event [2] and used it to lazyly populate directories in the path ancestry. > > For now, I stuck with the XXX_PERM convention and did not require > > FAN_CLASS_PRE_CONTENT, although we probably should. > > > > [2] https://github.com/amir73il/linux/commits/fanotify_pre_content > > > > Streaming read of large files works as well, but only for sequential read > > patterns. Unlike the FUSE read calls, the FAN_ACCESS_PERM events > > do not (yet) carry range info, so my naive implementation downloads > > one extra data chunk on each FAN_ACCESS_PERM until the cache file is full. > > > > This makes it possible to run commands like: > > > > tar tvfz /vdf/cache/data/linux/kernel/firmware/linux-firmware-20220815.tar.gz > > | less > > > > without having to wait for the entire 400MB file to download before > > seeing the first page. > > > > This streaming feature is extremely important for modern HSMs > > that are often used to archive large media files in the cloud. > > Thanks for update Amir! I've glanced through the series and so far it looks > pretty simple and I'd have only some style / readability nits (but let's > resolve those once we have something more complete). > > When thinking about HSM (and while following your discussion with Dave) I > wondered about one thing: When the notifications happen before we take > locks, then we are in principle prone to time-to-check-time-to-use races, > aren't we? How are these resolved? > > For example something like: > We have file with size 16k. > Reader: Writer > read 8k at offset 12k > -> notification sent > - HSM makes sure 12-16k is here and 16-20k is beyond eof so nothing to do > > expand file to 20k > - now the file contents must not get moved out until the reader is > done in order not to break it > Hi Jan, It's been a while since I updated this topic. I have been making progress on the wiki and POC, but it's not done yet. I would like to poke your brain about my proposed solutions for the TOCTOU race issues, because the solution is subtle and you may have better ideas to suggest. > > > Questions: > > - What do you think about the direction this POC has taken so far? > > - Is there anything specific that you would like to see in the POC > > to be convinced that this API will be useful? > > I think your POC is taking a good direction and your discussion with Dave > had made me more confident that this is all workable :). I liked your idea > of the wiki (or whatever form of documentation) that summarizes what we've > discussed in this thread. That would be actually pretty nice for future > reference. > The current state of POC is that "populate of access" of both files and directories is working and "race free evict of file content" is also implemented (safely AFAIK). The technique involving exclusive write lease is discussed at [1]. In a nutshell, populate and evict synchronize on atomic i_writecount and this technique can be implemented with upstream UAPIs. I did use persistent xattr marks for the POC, but this is not a must. Evictable inode marks would have worked just as well. Now I have started to work on persistent change tracking. For this, I have only kernel code, only lightly tested, but I did not prove yet that the technique is working. The idea that I started to sketch at [2] is to alternate between two groups. When a change is recorded, an evictable ignore mark will be added on the object. To start recording changes from a new point in time (checkpoint), a new group will be created (with no ignore marks) and the old group will be closed. The core of the algorithm is the "safe handover" between groups. This requires two infrastructure additions. The first is FAN_MARK_SYNC [3] as described in commit message: --- Synchronous add of mark or remove/flush of marks with ignore mask provides a method for safe handover of event handling between two groups: - First, group A subscribes to some events with FAN_MARK_SYNC - Then, group B unsubscribes from those events This method guarantees that any event that both groups subscribed to, will be delivered to either group or to both of them. Note that FAN_MARK_SYNC provides no synchronization to the object interest masks, which are checked outside srcu read side. Therefore, this method does not provide any guarantee regarding delivery of events which only one of the groups is subscribed to. For example, if only group B was subscribed to FAN_OPEN_EXEC and only group A is subscribing only to FAN_OPEN, an execution of a binary file may not deliver FAN_OPEN_EXEC to group B nor FAN_OPEN to group A. --- The second is to overlap fsnotify_mark_srcu read side with sb_start_write(), for pre modify permission events [4] as described in commit message: --- fsnotify: acquire sb write access inside pre modify permission event For pre modify permission events, acquire sb write access before leaving SRCU and return >0 to signal that sb write access was acquired. This can be used to implement safe "handover" of pre modify permission events between two fanotify groups: - First, group A subscribes to pre modify events with FAN_MARK_SYNC - Then, a freeze/thaw cycle is performed on the filesystem - Finally, group B unsubscribes from those events This method guarantees that a pre modify event that both groups subscribed to will be delivered to either group or to both of them. In case that the pre modify event is delivered only to group B, the freeze/thaw cycle guarantees that the filesystem modification that followed that pre modify event was also completed, before the handover is complete and group B can be closed. For pre rename permission event, acquire sb write access after the second of the event pair (i.e. rename to) was authorized. --- What do you think about this handover technique? Do you think that it is workable or do you see any major flaws in it? Would you use a different or an additional synchronization primitive instead of abusing fsnotify_mark_srcu? To clarify, the race that I am trying to avoid is: 1. group B got a pre modify event and recorded the change before time T 2. The actual modification is performed after time T 3. group A does not get a pre modify event, so does not record the change in the checkpoint since T Thanks, Amir. [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#invalidating-local-cache [2] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#tracking-local-modifications [3] https://github.com/amir73il/linux/commits/fan_mark_sync [4] https://github.com/amir73il/linux/commits/fan_modify_perm ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-10-28 12:50 ` Amir Goldstein @ 2022-11-03 16:30 ` Jan Kara 2022-11-04 8:17 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-03 16:30 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner On Fri 28-10-22 15:50:04, Amir Goldstein wrote: > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > > > > Questions: > > > - What do you think about the direction this POC has taken so far? > > > - Is there anything specific that you would like to see in the POC > > > to be convinced that this API will be useful? > > > > I think your POC is taking a good direction and your discussion with Dave > > had made me more confident that this is all workable :). I liked your idea > > of the wiki (or whatever form of documentation) that summarizes what we've > > discussed in this thread. That would be actually pretty nice for future > > reference. > > > > The current state of POC is that "populate of access" of both files > and directories is working and "race free evict of file content" is also > implemented (safely AFAIK). > > The technique involving exclusive write lease is discussed at [1]. > In a nutshell, populate and evict synchronize on atomic i_writecount > and this technique can be implemented with upstream UAPIs. Not so much i_writecount AFAIU but the generic lease mechanism overall. But yes, the currently existing APIs should be enough for your purposes. > I did use persistent xattr marks for the POC, but this is not a must. > Evictable inode marks would have worked just as well. OK. > Now I have started to work on persistent change tracking. > For this, I have only kernel code, only lightly tested, but I did not > prove yet that the technique is working. > > The idea that I started to sketch at [2] is to alternate between two groups. > > When a change is recorded, an evictable ignore mark will be added on the > object. To start recording changes from a new point in time > (checkpoint), a new group will be created (with no ignore marks) and the > old group will be closed. So what I dislike about the scheme with handover between two groups is that it is somewhat complex and furthermore requiring fs freezing for checkpoint is going to be rather expensive (and may be problematic if persistent change tracking is used by potentially many unpriviledged applications). As a side note I think it will be quite useful to be able to request checkpoint only for a subtree (e.g. some app may be interested only in a particular subtree) and the scheme with two groups will make any optimizations to benefit from such fact more difficult - either we create new group without ignore marks and then have to re-record changes nobody actually needs or we have to duplicate ignore marks which is potentially expensive as well. Let's think about the race: > To clarify, the race that I am trying to avoid is: > 1. group B got a pre modify event and recorded the change before time T > 2. The actual modification is performed after time T > 3. group A does not get a pre modify event, so does not record the change > in the checkpoint since T AFAIU you are worried about: Task T Change journal App write(file) generate pre_modify event record 'file' as modified Request changes Records 'file' contents modify 'file' data ... Request changes Nothing changed but App's view of 'file' is obsolete. Can't we solve this by creating POST_WRITE async event and then use it like: 1) Set state to CHECKPOINT_PENDING 2) In state CHECKPOINT_PENDING we record all received modify events into a separate 'transition' stream. 3) Remove ignore marks we need to remove. 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now recorded to the new period. 5) Merge all events from 'transition' stream to both old and new period event streams. 6) Events get removed from the 'transition' stream only once we receive POST_WRITE event corresponding to the PRE_WRITE event recorded there (or on crash recovery). This way some events from 'transition' stream may get merged to multiple period event streams if the checkpoints are frequent and writes take long. This should avoid the above race, should be relatively lightweight, and does not require major API extensions. BTW, while thinking about this I was wondering: How are the applications using persistent change journal going to deal with buffered vs direct IO? I currently don't see a scheme that would not loose modifications for some combinations... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-03 16:30 ` Jan Kara @ 2022-11-04 8:17 ` Amir Goldstein 2022-11-07 11:10 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-11-04 8:17 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Thu, Nov 3, 2022 at 6:30 PM Jan Kara <jack@suse.cz> wrote: > > On Fri 28-10-22 15:50:04, Amir Goldstein wrote: > > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > Questions: > > > > - What do you think about the direction this POC has taken so far? > > > > - Is there anything specific that you would like to see in the POC > > > > to be convinced that this API will be useful? > > > > > > I think your POC is taking a good direction and your discussion with Dave > > > had made me more confident that this is all workable :). I liked your idea > > > of the wiki (or whatever form of documentation) that summarizes what we've > > > discussed in this thread. That would be actually pretty nice for future > > > reference. > > > > > > > The current state of POC is that "populate of access" of both files > > and directories is working and "race free evict of file content" is also > > implemented (safely AFAIK). > > > > The technique involving exclusive write lease is discussed at [1]. > > In a nutshell, populate and evict synchronize on atomic i_writecount > > and this technique can be implemented with upstream UAPIs. > > Not so much i_writecount AFAIU but the generic lease mechanism overall. But > yes, the currently existing APIs should be enough for your purposes. > Right. Do note that the write lease is not reliable enough by itself to provide exclusive access to the content, because: 1. The lease break signal is delivered asynchronously to content evict program 2. After the lease break timeout expires, reader will get access to the content even if content eviction is in progress The actual strong exclusive access is provided by the sequence: 1. Open file for write 2. Deny future FAN_OPEN_PERM 3. Take write lease, but just to verify that i_writecount == 1 Notice one thing odd is that in do_dentry_open() the sequence is: 1. increment i_writecount 2. security_file_open() => FAN_OPEN_PERM 3. break_lease() However, FAN_OPEN_PERM is blocking and when listener reads the event, you get to: created_fd() => ... do_dentry_open(f_mode=FMODE_NONOTIFY): 4. may increment i_writecount 5. security_file_open() => FAN_OPEN_PERM skipped 6. break_lease() => send lease break signal The result is a bit non intuitive: If a new open is attempted during content evict, the new open will be blocked for the lease timeout, before the listener even gets a chance to respond. But if lease timeout has expired and the event listener denied the open, the lease break signal will still be delivered to the content evict program, despite the fact that the open is not going to proceed. Nevertheless, as you say, the existing APIs work well for my purpose. > > I did use persistent xattr marks for the POC, but this is not a must. > > Evictable inode marks would have worked just as well. > > OK. > > > Now I have started to work on persistent change tracking. > > For this, I have only kernel code, only lightly tested, but I did not > > prove yet that the technique is working. > > > > The idea that I started to sketch at [2] is to alternate between two groups. > > > > When a change is recorded, an evictable ignore mark will be added on the > > object. To start recording changes from a new point in time > > (checkpoint), a new group will be created (with no ignore marks) and the > > old group will be closed. > > So what I dislike about the scheme with handover between two groups is that > it is somewhat complex and furthermore requiring fs freezing for checkpoint > is going to be rather expensive (and may be problematic if persistent > change tracking is used by potentially many unpriviledged applications). > > As a side note I think it will be quite useful to be able to request > checkpoint only for a subtree (e.g. some app may be interested only in a > particular subtree) and the scheme with two groups will make any > optimizations to benefit from such fact more difficult - either we create > new group without ignore marks and then have to re-record changes nobody > actually needs or we have to duplicate ignore marks which is potentially > expensive as well. > For the records, checkpoint of a subtree is already requested by git-fsmonitor hook and already implemented using inotify recursive watches and in-memory change tracking by watchman and soon to be implemented as a git built-in fsmonitor [1]. The known caveats are the need to do a "full crawl" after fsmonitor service start or after event queue overflow and pinning of too many inodes to cache. Should we implement a systemd-fsmonitor service to serve git-fsmonitor hooks of all the users in the system, it will solve the scalability issue of recursive inode marks by using fanotify filesystem marks. It is understandable that one would expect the fsmonitor system service checkpoints to align with the local user checkpoints (i.e. git status). But this doesn't have to be this way. Imagine that the system change tracking service takes only nightly persistent checkpoints with fsfreeze regardless of the users git fsmonitor queries. The service can continue to use async events and in-memory change tracking to provide accurate subtree change queries, but in case of overflow/restart, instead of falling back to "full crawl", the queries will fall back to "changes since last persistent checkpoint". For some of our customers "full crawl" can take weeks. We have been using this scheme to address scalability of "full crawl" since the first version of Overlayfs snapshots in 2017. So our customers incur a maintenance penalty of fsfreeze during idle times (weekly by default) and in return, the worst case time for changed files query was reduced from weeks to minutes. It's all about making the right trade offs ;-) [1] https://lore.kernel.org/all/pull.1352.git.git.1665326258.gitgitgadget@gmail.com/ > Let's think about the race: > > > To clarify, the race that I am trying to avoid is: > > 1. group B got a pre modify event and recorded the change before time T > > 2. The actual modification is performed after time T > > 3. group A does not get a pre modify event, so does not record the change > > in the checkpoint since T > > AFAIU you are worried about: > > Task T Change journal App > > write(file) > generate pre_modify event > record 'file' as modified > Request changes > Records 'file' contents > modify 'file' data > > ... > Request changes > Nothing changed but > App's view of 'file' is obsolete. > > Can't we solve this by creating POST_WRITE async event and then use it like: > I like the idea of using POST_WRITE instead of holding sb_writers. > 1) Set state to CHECKPOINT_PENDING > 2) In state CHECKPOINT_PENDING we record all received modify events into a > separate 'transition' stream. > 3) Remove ignore marks we need to remove. Our customer use cases may have many millions of dirs. I don't think this solution will be scalable, which is why I use the alternating groups to invalidate all the existing ignore marks at once. But I agree that alternating groups should not be a requirement for HSM and that for watching smaller subtrees, your suggestion makes more sense. > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now > recorded to the new period. > 5) Merge all events from 'transition' stream to both old and new period > event streams. > 6) Events get removed from the 'transition' stream only once we receive > POST_WRITE event corresponding to the PRE_WRITE event recorded there (or > on crash recovery). This way some events from 'transition' stream may > get merged to multiple period event streams if the checkpoints are > frequent and writes take long. > > This should avoid the above race, should be relatively lightweight, and > does not require major API extensions. > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups is an implementation detail for the HSM. PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed for both the implementations (single group scheme needs to flush all ignore marks with FAN_MARK_SYNC). I am going to try to implement the PRE/POST_WRITE events and for POC I may start with a single group because it may be easier or I may implement both schemes, we'll see. > BTW, while thinking about this I was wondering: How are the applications > using persistent change journal going to deal with buffered vs direct IO? I > currently don't see a scheme that would not loose modifications for some > combinations... > This question is circling back to nfsd discussion about when to update i_version and when to update mtime, before the change is observed in core? on disk? both before and after? My only answer to that is a hybrid combination of using async events to track in-core changes and persistent change tracking for on-disk changes. fsfreeze before completion of the checkpoint is optional, but if fsfreeze is performed then it should answer your question about buffered vs direct IO and it makes the questions about in-memory vs. on-disk moot. Or maybe I did not understand the problem with buffered vs direct IO? Can you give an example for losing modification that does involve fsfreeze for completion of the checkpoint? Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-04 8:17 ` Amir Goldstein @ 2022-11-07 11:10 ` Jan Kara 2022-11-07 14:13 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-07 11:10 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner On Fri 04-11-22 10:17:54, Amir Goldstein wrote: > On Thu, Nov 3, 2022 at 6:30 PM Jan Kara <jack@suse.cz> wrote: > > > > On Fri 28-10-22 15:50:04, Amir Goldstein wrote: > > > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > Questions: > > > > > - What do you think about the direction this POC has taken so far? > > > > > - Is there anything specific that you would like to see in the POC > > > > > to be convinced that this API will be useful? > > > > > > > > I think your POC is taking a good direction and your discussion with Dave > > > > had made me more confident that this is all workable :). I liked your idea > > > > of the wiki (or whatever form of documentation) that summarizes what we've > > > > discussed in this thread. That would be actually pretty nice for future > > > > reference. > > > > > > > > > > The current state of POC is that "populate of access" of both files > > > and directories is working and "race free evict of file content" is also > > > implemented (safely AFAIK). > > > > > > The technique involving exclusive write lease is discussed at [1]. > > > In a nutshell, populate and evict synchronize on atomic i_writecount > > > and this technique can be implemented with upstream UAPIs. > > > > Not so much i_writecount AFAIU but the generic lease mechanism overall. But > > yes, the currently existing APIs should be enough for your purposes. > > > > Right. Do note that the write lease is not reliable enough by itself > to provide exclusive access to the content, because: > 1. The lease break signal is delivered asynchronously to content evict > program > 2. After the lease break timeout expires, reader will get access > to the content even if content eviction is in progress > > The actual strong exclusive access is provided by the sequence: > 1. Open file for write > 2. Deny future FAN_OPEN_PERM > 3. Take write lease, but just to verify that i_writecount == 1 > > Notice one thing odd is that in do_dentry_open() the sequence is: > 1. increment i_writecount > 2. security_file_open() => FAN_OPEN_PERM > 3. break_lease() > > However, FAN_OPEN_PERM is blocking and when listener > reads the event, you get to: > created_fd() => ... do_dentry_open(f_mode=FMODE_NONOTIFY): > 4. may increment i_writecount > 5. security_file_open() => FAN_OPEN_PERM skipped > 6. break_lease() => send lease break signal > > The result is a bit non intuitive: > > If a new open is attempted during content evict, the new open will be > blocked for the lease timeout, before the listener even gets a chance > to respond. > > But if lease timeout has expired and the event listener denied the open, > the lease break signal will still be delivered to the content evict program, > despite the fact that the open is not going to proceed. I see. I'd just note that allowing FID mode for permission events would solve both these problems, won't it? > > > I did use persistent xattr marks for the POC, but this is not a must. > > > Evictable inode marks would have worked just as well. > > > > OK. > > > > > Now I have started to work on persistent change tracking. > > > For this, I have only kernel code, only lightly tested, but I did not > > > prove yet that the technique is working. > > > > > > The idea that I started to sketch at [2] is to alternate between two groups. > > > > > > When a change is recorded, an evictable ignore mark will be added on the > > > object. To start recording changes from a new point in time > > > (checkpoint), a new group will be created (with no ignore marks) and the > > > old group will be closed. > > > > So what I dislike about the scheme with handover between two groups is that > > it is somewhat complex and furthermore requiring fs freezing for checkpoint > > is going to be rather expensive (and may be problematic if persistent > > change tracking is used by potentially many unpriviledged applications). > > > > As a side note I think it will be quite useful to be able to request > > checkpoint only for a subtree (e.g. some app may be interested only in a > > particular subtree) and the scheme with two groups will make any > > optimizations to benefit from such fact more difficult - either we create > > new group without ignore marks and then have to re-record changes nobody > > actually needs or we have to duplicate ignore marks which is potentially > > expensive as well. > > > > For the records, checkpoint of a subtree is already requested by > git-fsmonitor hook and already implemented using inotify recursive > watches and in-memory change tracking by watchman and soon > to be implemented as a git built-in fsmonitor [1]. > > The known caveats are the need to do a "full crawl" after fsmonitor > service start or after event queue overflow and pinning of too many > inodes to cache. > > Should we implement a systemd-fsmonitor service to serve > git-fsmonitor hooks of all the users in the system, it will solve the > scalability issue of recursive inode marks by using fanotify > filesystem marks. > > It is understandable that one would expect the fsmonitor system > service checkpoints to align with the local user checkpoints (i.e. git status). > But this doesn't have to be this way. > > Imagine that the system change tracking service takes only > nightly persistent checkpoints with fsfreeze regardless of the > users git fsmonitor queries. > > The service can continue to use async events and in-memory > change tracking to provide accurate subtree change queries, > but in case of overflow/restart, instead of falling back to "full crawl", > the queries will fall back to "changes since last persistent checkpoint". > > For some of our customers "full crawl" can take weeks. > We have been using this scheme to address scalability of "full crawl" > since the first version of Overlayfs snapshots in 2017. > > So our customers incur a maintenance penalty of fsfreeze during > idle times (weekly by default) and in return, the worst case time for > changed files query was reduced from weeks to minutes. > It's all about making the right trade offs ;-) > > [1] https://lore.kernel.org/all/pull.1352.git.git.1665326258.gitgitgadget@gmail.com/ Right. I just suspect that with systemd-fsmonitor when more applications start using it, it will be more difficult to tune this as e.g. for git-status I can imagine daily snapshots will not be that much useful (not save that much time) and users would rather see something like hourly snapshots or even more frequent at which point things become somewhat inconvenient and performance heavy. So we don't have to implement this now but I'd prefer we keep this as a possibilty for future API extension... > > Let's think about the race: > > > > > To clarify, the race that I am trying to avoid is: > > > 1. group B got a pre modify event and recorded the change before time T > > > 2. The actual modification is performed after time T > > > 3. group A does not get a pre modify event, so does not record the change > > > in the checkpoint since T > > > > AFAIU you are worried about: > > > > Task T Change journal App > > > > write(file) > > generate pre_modify event > > record 'file' as modified > > Request changes > > Records 'file' contents > > modify 'file' data > > > > ... > > Request changes > > Nothing changed but > > App's view of 'file' is obsolete. > > > > Can't we solve this by creating POST_WRITE async event and then use it like: > > > > I like the idea of using POST_WRITE instead of holding sb_writers. > > > 1) Set state to CHECKPOINT_PENDING > > 2) In state CHECKPOINT_PENDING we record all received modify events into a > > separate 'transition' stream. > > 3) Remove ignore marks we need to remove. > > Our customer use cases may have many millions of dirs. > I don't think this solution will be scalable, which is why I use the > alternating groups to invalidate all the existing ignore marks at once. I see. Well, we could also extend FAN_MARK_FLUSH so that you can just remove all ignore marks from a group so that you don't have to remove them one-by-one and don't have to switch to a new group. In principle group teardown does the same. It would allow large scale as well as small scale users use very similar scheme with single group for switching periods. > But I agree that alternating groups should not be a requirement for HSM > and that for watching smaller subtrees, your suggestion makes more sense. > > > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now > > recorded to the new period. > > 5) Merge all events from 'transition' stream to both old and new period > > event streams. > > 6) Events get removed from the 'transition' stream only once we receive > > POST_WRITE event corresponding to the PRE_WRITE event recorded there (or > > on crash recovery). This way some events from 'transition' stream may > > get merged to multiple period event streams if the checkpoints are > > frequent and writes take long. > > > > This should avoid the above race, should be relatively lightweight, and > > does not require major API extensions. > > > > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups > is an implementation detail for the HSM. > > PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed > for both the implementations (single group scheme needs to flush all > ignore marks with FAN_MARK_SYNC). So why would be FAN_MARK_SYNC needed for the single group scheme? From the kernel POV the scheme I have proposed does not require any new API changes besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for more efficient removal of ignore marks. We don't even need the filesystem freezing (modulo the buffered vs direct IO discussion below). > I am going to try to implement the PRE/POST_WRITE events and for > POC I may start with a single group because it may be easier or I may > implement both schemes, we'll see. That would be great. Thank you. Perhaps I'm missing something in my mental model which would make things impractical :) > > BTW, while thinking about this I was wondering: How are the applications > > using persistent change journal going to deal with buffered vs direct IO? I > > currently don't see a scheme that would not loose modifications for some > > combinations... > > This question is circling back to nfsd discussion about when to update > i_version and when to update mtime, before the change is observed > in core? on disk? both before and after? > > My only answer to that is a hybrid combination of using async > events to track in-core changes and persistent change tracking > for on-disk changes. > > fsfreeze before completion of the checkpoint is optional, but if > fsfreeze is performed then it should answer your question about > buffered vs direct IO and it makes the questions about in-memory > vs. on-disk moot. > > Or maybe I did not understand the problem with buffered vs direct IO? > Can you give an example for losing modification that does involve > fsfreeze for completion of the checkpoint? You're right fsfreeze during period transision also solves the issue with buffered vs direct IO and I'm not aware of any other reliable solution. It just seems as a rather big (and expensive) hammer. Probably fine for certain applications (such as daily system backup) but not so much for others (such as each of your desktop apps wanting to efficiently check for changes in their config directories on startup). So probably there's market for both solutions with and without fsfreeze ;) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-07 11:10 ` Jan Kara @ 2022-11-07 14:13 ` Amir Goldstein 2022-11-14 19:17 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-11-07 14:13 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Mon, Nov 7, 2022 at 1:10 PM Jan Kara <jack@suse.cz> wrote: > > On Fri 04-11-22 10:17:54, Amir Goldstein wrote: > > On Thu, Nov 3, 2022 at 6:30 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Fri 28-10-22 15:50:04, Amir Goldstein wrote: > > > > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > > Questions: > > > > > > - What do you think about the direction this POC has taken so far? > > > > > > - Is there anything specific that you would like to see in the POC > > > > > > to be convinced that this API will be useful? > > > > > > > > > > I think your POC is taking a good direction and your discussion with Dave > > > > > had made me more confident that this is all workable :). I liked your idea > > > > > of the wiki (or whatever form of documentation) that summarizes what we've > > > > > discussed in this thread. That would be actually pretty nice for future > > > > > reference. > > > > > > > > > > > > > The current state of POC is that "populate of access" of both files > > > > and directories is working and "race free evict of file content" is also > > > > implemented (safely AFAIK). > > > > > > > > The technique involving exclusive write lease is discussed at [1]. > > > > In a nutshell, populate and evict synchronize on atomic i_writecount > > > > and this technique can be implemented with upstream UAPIs. > > > > > > Not so much i_writecount AFAIU but the generic lease mechanism overall. But > > > yes, the currently existing APIs should be enough for your purposes. > > > > > > > Right. Do note that the write lease is not reliable enough by itself > > to provide exclusive access to the content, because: > > 1. The lease break signal is delivered asynchronously to content evict > > program > > 2. After the lease break timeout expires, reader will get access > > to the content even if content eviction is in progress > > > > The actual strong exclusive access is provided by the sequence: > > 1. Open file for write > > 2. Deny future FAN_OPEN_PERM > > 3. Take write lease, but just to verify that i_writecount == 1 > > > > Notice one thing odd is that in do_dentry_open() the sequence is: > > 1. increment i_writecount > > 2. security_file_open() => FAN_OPEN_PERM > > 3. break_lease() > > > > However, FAN_OPEN_PERM is blocking and when listener > > reads the event, you get to: > > created_fd() => ... do_dentry_open(f_mode=FMODE_NONOTIFY): > > 4. may increment i_writecount > > 5. security_file_open() => FAN_OPEN_PERM skipped > > 6. break_lease() => send lease break signal > > > > The result is a bit non intuitive: > > > > If a new open is attempted during content evict, the new open will be > > blocked for the lease timeout, before the listener even gets a chance > > to respond. > > > > But if lease timeout has expired and the event listener denied the open, > > the lease break signal will still be delivered to the content evict program, > > despite the fact that the open is not going to proceed. > > I see. I'd just note that allowing FID mode for permission events would > solve both these problems, won't it? > Yes, it would. event_f_flags = O_PATH also does not have this wrinkle. Originally, I had considered using FIDs in the permission events, but I realized the event->fd is used as a key to the permission response, so I would have had to replace event->fd with some response cookie. Anyway, this oddity is not a problem for me now. [...] > > > Let's think about the race: > > > > > > > To clarify, the race that I am trying to avoid is: > > > > 1. group B got a pre modify event and recorded the change before time T > > > > 2. The actual modification is performed after time T > > > > 3. group A does not get a pre modify event, so does not record the change > > > > in the checkpoint since T > > > > > > AFAIU you are worried about: > > > > > > Task T Change journal App > > > > > > write(file) > > > generate pre_modify event > > > record 'file' as modified > > > Request changes > > > Records 'file' contents > > > modify 'file' data > > > > > > ... > > > Request changes > > > Nothing changed but > > > App's view of 'file' is obsolete. > > > > > > Can't we solve this by creating POST_WRITE async event and then use it like: > > > > > > > I like the idea of using POST_WRITE instead of holding sb_writers. > > > > > 1) Set state to CHECKPOINT_PENDING > > > 2) In state CHECKPOINT_PENDING we record all received modify events into a > > > separate 'transition' stream. > > > 3) Remove ignore marks we need to remove. > > > > Our customer use cases may have many millions of dirs. > > I don't think this solution will be scalable, which is why I use the > > alternating groups to invalidate all the existing ignore marks at once. > > I see. Well, we could also extend FAN_MARK_FLUSH so that you can just > remove all ignore marks from a group so that you don't have to remove them > one-by-one and don't have to switch to a new group. In principle group > teardown does the same. It would allow large scale as well as small scale > users use very similar scheme with single group for switching periods. > Maybe so, I need to try it to see if it can scale. But note that in the alternating group scheme we do NOT need to wait for the old group teardown to complete before returning the results of the query: 1. group T+1 subscribes to some events with FAN_MARK_SYNC 2. group T unsubscribes from those events Step 2 can be done in the background. The query which returns all the files modified between checkpoints T..T+1 can already return the changes recorded by group T while group T is shutting down and cleaning up all the evictable marks. > > But I agree that alternating groups should not be a requirement for HSM > > and that for watching smaller subtrees, your suggestion makes more sense. > > > > > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now > > > recorded to the new period. > > > 5) Merge all events from 'transition' stream to both old and new period > > > event streams. > > > 6) Events get removed from the 'transition' stream only once we receive > > > POST_WRITE event corresponding to the PRE_WRITE event recorded there (or > > > on crash recovery). This way some events from 'transition' stream may > > > get merged to multiple period event streams if the checkpoints are > > > frequent and writes take long. > > > > > > This should avoid the above race, should be relatively lightweight, and > > > does not require major API extensions. > > > > > > > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups > > is an implementation detail for the HSM. > > > > PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed > > for both the implementations (single group scheme needs to flush all > > ignore marks with FAN_MARK_SYNC). > > So why would be FAN_MARK_SYNC needed for the single group scheme? From the > kernel POV the scheme I have proposed does not require any new API changes > besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for > more efficient removal of ignore marks. We don't even need the filesystem > freezing (modulo the buffered vs direct IO discussion below). > Maybe I'm wrong, but my understanding is that after: 3) Remove ignore marks we need to remove. a PRE_WRITE event may still be in send_to_group() with one of the "removed" ignore marks and be ignored. So it is not safe to: 4) Switch to new period & clear CHECKPOINT_PENDING My understanding is that synchronize_srcu(&fsnotify_mark_srcu); is needed as barrier between 3) and 4) In any case, even if CHECKPOINT_PENDING can work, with or without FAN_MARK_SYNC, to me personally, understanding the proof of correctness of alternating groups model is very easy, while proving correctness for CHECKPOINT_PENDING model is something that I was not yet able to accomplish. > > I am going to try to implement the PRE/POST_WRITE events and for > > POC I may start with a single group because it may be easier or I may > > implement both schemes, we'll see. > > That would be great. Thank you. Perhaps I'm missing something in my mental > model which would make things impractical :) > Me too. I won't know before I try. FYI, at the moment I am considering not allowing independent subscription for PRE/POST_XXX events, but only subscribe to XXX events and a group with class FAN_CLASS_VFS_FILTER will get both PRE/POST_XXX and won't be able to subscribe to XXX events that do not have PRE_XXX events. The rationale is that if a group subscribes to either PRE/POST_XXX XXX() operation is not going to be on the fast path anyway. This will make it easier to support more PRE/POST_XXX events without using up all the remaining 32bits namespace. Using the high 32bits of mask for PRE events and folding them in the object interest mask with the low 32bit is another thing that I was considering in case you would prefer to allow independent subscription for PRE/POST_XXX events. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-07 14:13 ` Amir Goldstein @ 2022-11-14 19:17 ` Jan Kara 2022-11-14 20:08 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-14 19:17 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner On Mon 07-11-22 16:13:37, Amir Goldstein wrote: > On Mon, Nov 7, 2022 at 1:10 PM Jan Kara <jack@suse.cz> wrote: > > > > Let's think about the race: > > > > > > > > > To clarify, the race that I am trying to avoid is: > > > > > 1. group B got a pre modify event and recorded the change before time T > > > > > 2. The actual modification is performed after time T > > > > > 3. group A does not get a pre modify event, so does not record the change > > > > > in the checkpoint since T > > > > > > > > AFAIU you are worried about: > > > > > > > > Task T Change journal App > > > > > > > > write(file) > > > > generate pre_modify event > > > > record 'file' as modified > > > > Request changes > > > > Records 'file' contents > > > > modify 'file' data > > > > > > > > ... > > > > Request changes > > > > Nothing changed but > > > > App's view of 'file' is obsolete. > > > > > > > > Can't we solve this by creating POST_WRITE async event and then use it like: > > > > > > > > > > I like the idea of using POST_WRITE instead of holding sb_writers. > > > > > > > 1) Set state to CHECKPOINT_PENDING > > > > 2) In state CHECKPOINT_PENDING we record all received modify events into a > > > > separate 'transition' stream. > > > > 3) Remove ignore marks we need to remove. > > > > > > Our customer use cases may have many millions of dirs. > > > I don't think this solution will be scalable, which is why I use the > > > alternating groups to invalidate all the existing ignore marks at once. > > > > I see. Well, we could also extend FAN_MARK_FLUSH so that you can just > > remove all ignore marks from a group so that you don't have to remove them > > one-by-one and don't have to switch to a new group. In principle group > > teardown does the same. It would allow large scale as well as small scale > > users use very similar scheme with single group for switching periods. > > > > Maybe so, I need to try it to see if it can scale. > > But note that in the alternating group scheme we do NOT need to wait for > the old group teardown to complete before returning the results of the query: > 1. group T+1 subscribes to some events with FAN_MARK_SYNC > 2. group T unsubscribes from those events > > Step 2 can be done in the background. > The query which returns all the files modified between checkpoints T..T+1 > can already return the changes recorded by group T while group T is > shutting down and cleaning up all the evictable marks. I agree. With my scheme with a single group we first need to remove the ignore marks before we can report events for T and start collecting events for T+1. Total amount of work is going to be the same but latency of reporting is going to be higher. I'd just think that it would not be too bad but it needs to be measured. Also from the scheme you've described the declaration of a checkpoint didn't seem like a latency sensitive operation? > > > But I agree that alternating groups should not be a requirement for HSM > > > and that for watching smaller subtrees, your suggestion makes more sense. > > > > > > > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now > > > > recorded to the new period. > > > > 5) Merge all events from 'transition' stream to both old and new period > > > > event streams. > > > > 6) Events get removed from the 'transition' stream only once we receive > > > > POST_WRITE event corresponding to the PRE_WRITE event recorded there (or > > > > on crash recovery). This way some events from 'transition' stream may > > > > get merged to multiple period event streams if the checkpoints are > > > > frequent and writes take long. > > > > > > > > This should avoid the above race, should be relatively lightweight, and > > > > does not require major API extensions. > > > > > > > > > > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups > > > is an implementation detail for the HSM. > > > > > > PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed > > > for both the implementations (single group scheme needs to flush all > > > ignore marks with FAN_MARK_SYNC). > > > > So why would be FAN_MARK_SYNC needed for the single group scheme? From the > > kernel POV the scheme I have proposed does not require any new API changes > > besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for > > more efficient removal of ignore marks. We don't even need the filesystem > > freezing (modulo the buffered vs direct IO discussion below). > > > > Maybe I'm wrong, but my understanding is that after: > 3) Remove ignore marks we need to remove. > a PRE_WRITE event may still be in send_to_group() > with one of the "removed" ignore marks and be ignored. > > So it is not safe to: > 4) Switch to new period & clear CHECKPOINT_PENDING Well, but we'd still get the POST_WRITE event, evaluate this as a write straddling checkpoint and include the file into the set of changed files for checkpoint T+1 or later. So I don't think synchronize_srcu() is needed anywhere? > My understanding is that > synchronize_srcu(&fsnotify_mark_srcu); > is needed as barrier between 3) and 4) > > In any case, even if CHECKPOINT_PENDING can work, > with or without FAN_MARK_SYNC, to me personally, understanding > the proof of correctness of alternating groups model is very easy, > while proving correctness for CHECKPOINT_PENDING model is > something that I was not yet able to accomplish. I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I don't find your scheme with two groups simpler ;) Let me try to write down rationale for my scheme, I think I can even somewhat simplify it: Write operation consists of: generate PRE_WRITE on F modify data of F generate POST_WRITE on F Checkpoint consists of: clear ignore marks report files for which we received PRE_WRITE or POST_WRITE until this moment And the invariant we try to provide is: If file F was modified during checkpoint T, then we report F as modified during T or some later checkpoint. Where it is matter of quality of implementation that "some later checkpoint" isn't too much later than T but it depends on the frequency of checkpoints, the length of notification queue, etc. so it is difficult to give hard guarantees. And the argument why the scheme maintains the invariant is that if POST_WRITE is generated after "clear ignore marks" finishes, it will get delivered and thus F will get reported as modified in some checkpoint once the event is processed. If POST_WRITE gets generated before "clear ignore marks" finishes and F is among ignored inodes, it means F is already in modified set so it will get reported as part of checkpoint T. Also application will already see modified data when processing list of modified files in checkpoint T. Simple and we don't even need PRE_WRITE here. But maybe you wanted to provide stronger invariant? Like "you are not able to see modified data without seeing F as modified?" But what exactly would be a guarantee here? I feel I'm missing something here but I cannot quite grasp it at this moment... > > > I am going to try to implement the PRE/POST_WRITE events and for > > > POC I may start with a single group because it may be easier or I may > > > implement both schemes, we'll see. > > > > That would be great. Thank you. Perhaps I'm missing something in my mental > > model which would make things impractical :) > > > > Me too. I won't know before I try. > > FYI, at the moment I am considering not allowing independent > subscription for PRE/POST_XXX events, but only subscribe to > XXX events and a group with class FAN_CLASS_VFS_FILTER > will get both PRE/POST_XXX and won't be able to subscribe > to XXX events that do not have PRE_XXX events. > > The rationale is that if a group subscribes to either PRE/POST_XXX > XXX() operation is not going to be on the fast path anyway. > > This will make it easier to support more PRE/POST_XXX events > without using up all the remaining 32bits namespace. > > Using the high 32bits of mask for PRE events and folding them > in the object interest mask with the low 32bit is another thing > that I was considering in case you would prefer to allow > independent subscription for PRE/POST_XXX events. So one question: Do you see anything else than POST_WRITE as being useful? For directory operations it seems pointless as they hold i_rwsem exclusively so I don't see useful distinction between PRE and POST event. For OPEN and CLOSE I don't see use either. ACCESS might be the only one where PRE and POST would both be useful for something. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-14 19:17 ` Jan Kara @ 2022-11-14 20:08 ` Amir Goldstein 2022-11-15 10:16 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-11-14 20:08 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@suse.cz> wrote: > > On Mon 07-11-22 16:13:37, Amir Goldstein wrote: > > On Mon, Nov 7, 2022 at 1:10 PM Jan Kara <jack@suse.cz> wrote: > > > > > Let's think about the race: > > > > > > > > > > > To clarify, the race that I am trying to avoid is: > > > > > > 1. group B got a pre modify event and recorded the change before time T > > > > > > 2. The actual modification is performed after time T > > > > > > 3. group A does not get a pre modify event, so does not record the change > > > > > > in the checkpoint since T > > > > > > > > > > AFAIU you are worried about: > > > > > > > > > > Task T Change journal App > > > > > > > > > > write(file) > > > > > generate pre_modify event > > > > > record 'file' as modified > > > > > Request changes > > > > > Records 'file' contents > > > > > modify 'file' data > > > > > > > > > > ... > > > > > Request changes > > > > > Nothing changed but > > > > > App's view of 'file' is obsolete. > > > > > > > > > > Can't we solve this by creating POST_WRITE async event and then use it like: > > > > > > > > > > > > > I like the idea of using POST_WRITE instead of holding sb_writers. > > > > > > > > > 1) Set state to CHECKPOINT_PENDING > > > > > 2) In state CHECKPOINT_PENDING we record all received modify events into a > > > > > separate 'transition' stream. > > > > > 3) Remove ignore marks we need to remove. > > > > > > > > Our customer use cases may have many millions of dirs. > > > > I don't think this solution will be scalable, which is why I use the > > > > alternating groups to invalidate all the existing ignore marks at once. > > > > > > I see. Well, we could also extend FAN_MARK_FLUSH so that you can just > > > remove all ignore marks from a group so that you don't have to remove them > > > one-by-one and don't have to switch to a new group. In principle group > > > teardown does the same. It would allow large scale as well as small scale > > > users use very similar scheme with single group for switching periods. > > > > > > > Maybe so, I need to try it to see if it can scale. > > > > But note that in the alternating group scheme we do NOT need to wait for > > the old group teardown to complete before returning the results of the query: > > 1. group T+1 subscribes to some events with FAN_MARK_SYNC > > 2. group T unsubscribes from those events > > > > Step 2 can be done in the background. > > The query which returns all the files modified between checkpoints T..T+1 > > can already return the changes recorded by group T while group T is > > shutting down and cleaning up all the evictable marks. > > I agree. With my scheme with a single group we first need to remove the > ignore marks before we can report events for T and start collecting events > for T+1. Total amount of work is going to be the same but latency of > reporting is going to be higher. I'd just think that it would not be too bad > but it needs to be measured. Also from the scheme you've described the > declaration of a checkpoint didn't seem like a latency sensitive operation? > I guess not. I will look into it. > > > > But I agree that alternating groups should not be a requirement for HSM > > > > and that for watching smaller subtrees, your suggestion makes more sense. > > > > > > > > > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now > > > > > recorded to the new period. > > > > > 5) Merge all events from 'transition' stream to both old and new period > > > > > event streams. > > > > > 6) Events get removed from the 'transition' stream only once we receive > > > > > POST_WRITE event corresponding to the PRE_WRITE event recorded there (or > > > > > on crash recovery). This way some events from 'transition' stream may > > > > > get merged to multiple period event streams if the checkpoints are > > > > > frequent and writes take long. > > > > > > > > > > This should avoid the above race, should be relatively lightweight, and > > > > > does not require major API extensions. > > > > > > > > > > > > > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups > > > > is an implementation detail for the HSM. > > > > > > > > PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed > > > > for both the implementations (single group scheme needs to flush all > > > > ignore marks with FAN_MARK_SYNC). > > > > > > So why would be FAN_MARK_SYNC needed for the single group scheme? From the > > > kernel POV the scheme I have proposed does not require any new API changes > > > besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for > > > more efficient removal of ignore marks. We don't even need the filesystem > > > freezing (modulo the buffered vs direct IO discussion below). > > > > > > > Maybe I'm wrong, but my understanding is that after: > > 3) Remove ignore marks we need to remove. > > a PRE_WRITE event may still be in send_to_group() > > with one of the "removed" ignore marks and be ignored. > > > > So it is not safe to: > > 4) Switch to new period & clear CHECKPOINT_PENDING > > Well, but we'd still get the POST_WRITE event, evaluate this as a write > straddling checkpoint and include the file into the set of changed files > for checkpoint T+1 or later. So I don't think synchronize_srcu() is needed > anywhere? > You are probably right. I will double check. > > My understanding is that > > synchronize_srcu(&fsnotify_mark_srcu); > > is needed as barrier between 3) and 4) > > > > In any case, even if CHECKPOINT_PENDING can work, > > with or without FAN_MARK_SYNC, to me personally, understanding > > the proof of correctness of alternating groups model is very easy, > > while proving correctness for CHECKPOINT_PENDING model is > > something that I was not yet able to accomplish. > > I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I > don't find your scheme with two groups simpler ;) Let me try to write down > rationale for my scheme, I think I can even somewhat simplify it: > > Write operation consists of: > generate PRE_WRITE on F > modify data of F > generate POST_WRITE on F > > Checkpoint consists of: > clear ignore marks > report files for which we received PRE_WRITE or POST_WRITE until this > moment > > And the invariant we try to provide is: If file F was modified during > checkpoint T, then we report F as modified during T or some later > checkpoint. Where it is matter of quality of implementation that "some > later checkpoint" isn't too much later than T but it depends on the > frequency of checkpoints, the length of notification queue, etc. so it is > difficult to give hard guarantees. > > And the argument why the scheme maintains the invariant is that if > POST_WRITE is generated after "clear ignore marks" finishes, it will get > delivered and thus F will get reported as modified in some checkpoint once > the event is processed. If POST_WRITE gets generated before "clear ignore > marks" finishes and F is among ignored inodes, it means F is already in > modified set so it will get reported as part of checkpoint T. Also > application will already see modified data when processing list of modified > files in checkpoint T. > > Simple and we don't even need PRE_WRITE here. But maybe you wanted to > provide stronger invariant? Like "you are not able to see modified data > without seeing F as modified?" But what exactly would be a guarantee here? > I feel I'm missing something here but I cannot quite grasp it at this > moment... > This is the very basic guarantee that the persistent change tracking snapshots need to provide. If a crash happens after modification is done and before modification is recorded, we won't detect the modification after reboot. Maybe "checkpoint" was a bad name to use for this handover between two subsequent change tracking snapshots. > > > > I am going to try to implement the PRE/POST_WRITE events and for > > > > POC I may start with a single group because it may be easier or I may > > > > implement both schemes, we'll see. > > > > > > That would be great. Thank you. Perhaps I'm missing something in my mental > > > model which would make things impractical :) > > > > > > > Me too. I won't know before I try. > > > > FYI, at the moment I am considering not allowing independent > > subscription for PRE/POST_XXX events, but only subscribe to > > XXX events and a group with class FAN_CLASS_VFS_FILTER > > will get both PRE/POST_XXX and won't be able to subscribe > > to XXX events that do not have PRE_XXX events. > > > > The rationale is that if a group subscribes to either PRE/POST_XXX > > XXX() operation is not going to be on the fast path anyway. > > > > This will make it easier to support more PRE/POST_XXX events > > without using up all the remaining 32bits namespace. > > > > Using the high 32bits of mask for PRE events and folding them > > in the object interest mask with the low 32bit is another thing > > that I was considering in case you would prefer to allow > > independent subscription for PRE/POST_XXX events. > > So one question: Do you see anything else than POST_WRITE as being useful? > For directory operations it seems pointless as they hold i_rwsem exclusively > so I don't see useful distinction between PRE and POST event. For OPEN and > CLOSE I don't see use either. ACCESS might be the only one where PRE and > POST would both be useful for something. > PRE_ACCESS is used to populate missing data and POST_ACCESS is irrelevant for that. PRE_MODIFY is used for something completely different, it is used for the persistent change tracking and this has to be crash safe, so exclusive i_rwsem has nothing to do with it. PRE_MODIFY is called before i_rwsem and before sb_start_write() so that HSM can record the change intent in the same filesystem that the change is going to happen (this is a journaled record). I feel I may have not explained this correctly. Does this make sense? Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-14 20:08 ` Amir Goldstein @ 2022-11-15 10:16 ` Jan Kara 2022-11-15 13:08 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-15 10:16 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner On Mon 14-11-22 22:08:16, Amir Goldstein wrote: > On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@suse.cz> wrote: > > > My understanding is that > > > synchronize_srcu(&fsnotify_mark_srcu); > > > is needed as barrier between 3) and 4) > > > > > > In any case, even if CHECKPOINT_PENDING can work, > > > with or without FAN_MARK_SYNC, to me personally, understanding > > > the proof of correctness of alternating groups model is very easy, > > > while proving correctness for CHECKPOINT_PENDING model is > > > something that I was not yet able to accomplish. > > > > I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I > > don't find your scheme with two groups simpler ;) Let me try to write down > > rationale for my scheme, I think I can even somewhat simplify it: > > > > Write operation consists of: > > generate PRE_WRITE on F > > modify data of F > > generate POST_WRITE on F > > > > Checkpoint consists of: > > clear ignore marks > > report files for which we received PRE_WRITE or POST_WRITE until this > > moment > > > > And the invariant we try to provide is: If file F was modified during > > checkpoint T, then we report F as modified during T or some later > > checkpoint. Where it is matter of quality of implementation that "some > > later checkpoint" isn't too much later than T but it depends on the > > frequency of checkpoints, the length of notification queue, etc. so it is > > difficult to give hard guarantees. > > > > And the argument why the scheme maintains the invariant is that if > > POST_WRITE is generated after "clear ignore marks" finishes, it will get > > delivered and thus F will get reported as modified in some checkpoint once > > the event is processed. If POST_WRITE gets generated before "clear ignore > > marks" finishes and F is among ignored inodes, it means F is already in > > modified set so it will get reported as part of checkpoint T. Also > > application will already see modified data when processing list of modified > > files in checkpoint T. > > > > Simple and we don't even need PRE_WRITE here. But maybe you wanted to > > provide stronger invariant? Like "you are not able to see modified data > > without seeing F as modified?" But what exactly would be a guarantee here? > > I feel I'm missing something here but I cannot quite grasp it at this > > moment... > > > > This is the very basic guarantee that the persistent change tracking snapshots > need to provide. If a crash happens after modification is done and before > modification is recorded, we won't detect the modification after reboot. Right, crash safety was the point I was missing ;) Thanks for reminding me. And now I also see why you use filesystem freezing - it is a way to make things crash safe as otherwise it is difficult to guard against a race generate PRE_WRITE for F PRE_WRITE ignored because file is already modified checkpoint happens -> F reported as modified contents of F fetched modify data transaction commit <crash> POST_WRITE never seen so change to F is never reported I just think filesystem freezing is too big hammer for widespread use of persistent change tracking. Can't we introduce some SRCU lock / unlock into file_start_write() / file_end_write() and then invoke synchronize_srcu() during checkpoint after removing ignore marks? It will be much cheaper as we don't have to flush all dirty data to disk and also writes can keep flowing while we wait for outstanding writes straddling checkpoint to complete. What do you think? The checkpoint would then do: start gathering changes for both T and T+1 clear ignore marks synchronize_srcu() stop gathering changes for T and report them And in this case we would not need POST_WRITE as an event. The technical problem I see is how to deal with AIO / io_uring because SRCU needs to be released in the same context as it is acquired - that would need to be consulted with Paul McKenney if we can make it work. And another problem I see is that it might not be great to have this system-wide as e.g. on networking filesystems or pipes writes can block for really long. Final question is how to expose this to userspace because this functionality would seem useful outside of filesystem notification space so probably do not need to tie it to that. Or we could simplify our life somewhat and acquire SRCU when generating PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU within fsnotify and would mitigate the problems coming from system-wide SRCU. OTOH it will create problems when PRE_WRITE gets generated and POST_WRITE would not for some reason. Just branstorming here, I've not really decided what's better... > Maybe "checkpoint" was a bad name to use for this handover between > two subsequent change tracking snapshots. I'm getting used to the terminology :) But to me it still seems more natural to look at the situation as a single stream of events where we fetch bulks of changes at certain moments, rather than looking at it as certain entities collecting events for different time intervals. > > > > > I am going to try to implement the PRE/POST_WRITE events and for > > > > > POC I may start with a single group because it may be easier or I may > > > > > implement both schemes, we'll see. > > > > > > > > That would be great. Thank you. Perhaps I'm missing something in my mental > > > > model which would make things impractical :) > > > > > > > > > > Me too. I won't know before I try. > > > > > > FYI, at the moment I am considering not allowing independent > > > subscription for PRE/POST_XXX events, but only subscribe to > > > XXX events and a group with class FAN_CLASS_VFS_FILTER > > > will get both PRE/POST_XXX and won't be able to subscribe > > > to XXX events that do not have PRE_XXX events. > > > > > > The rationale is that if a group subscribes to either PRE/POST_XXX > > > XXX() operation is not going to be on the fast path anyway. > > > > > > This will make it easier to support more PRE/POST_XXX events > > > without using up all the remaining 32bits namespace. > > > > > > Using the high 32bits of mask for PRE events and folding them > > > in the object interest mask with the low 32bit is another thing > > > that I was considering in case you would prefer to allow > > > independent subscription for PRE/POST_XXX events. > > > > So one question: Do you see anything else than POST_WRITE as being useful? > > For directory operations it seems pointless as they hold i_rwsem exclusively > > so I don't see useful distinction between PRE and POST event. For OPEN and > > CLOSE I don't see use either. ACCESS might be the only one where PRE and > > POST would both be useful for something. > > PRE_ACCESS is used to populate missing data and POST_ACCESS > is irrelevant for that. > > PRE_MODIFY is used for something completely different, it is used > for the persistent change tracking and this has to be crash safe, so > exclusive i_rwsem has nothing to do with it. > > PRE_MODIFY is called before i_rwsem and before sb_start_write() > so that HSM can record the change intent in the same filesystem > that the change is going to happen (this is a journaled record). > > I feel I may have not explained this correctly. > Does this make sense? Yes, makes sense. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-15 10:16 ` Jan Kara @ 2022-11-15 13:08 ` Amir Goldstein 2022-11-16 10:56 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-11-15 13:08 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Tue, Nov 15, 2022 at 12:16 PM Jan Kara <jack@suse.cz> wrote: > > On Mon 14-11-22 22:08:16, Amir Goldstein wrote: > > On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@suse.cz> wrote: > > > > My understanding is that > > > > synchronize_srcu(&fsnotify_mark_srcu); > > > > is needed as barrier between 3) and 4) > > > > > > > > In any case, even if CHECKPOINT_PENDING can work, > > > > with or without FAN_MARK_SYNC, to me personally, understanding > > > > the proof of correctness of alternating groups model is very easy, > > > > while proving correctness for CHECKPOINT_PENDING model is > > > > something that I was not yet able to accomplish. > > > > > > I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I > > > don't find your scheme with two groups simpler ;) Let me try to write down > > > rationale for my scheme, I think I can even somewhat simplify it: > > > > > > Write operation consists of: > > > generate PRE_WRITE on F > > > modify data of F > > > generate POST_WRITE on F > > > > > > Checkpoint consists of: > > > clear ignore marks > > > report files for which we received PRE_WRITE or POST_WRITE until this > > > moment > > > > > > And the invariant we try to provide is: If file F was modified during > > > checkpoint T, then we report F as modified during T or some later > > > checkpoint. Where it is matter of quality of implementation that "some > > > later checkpoint" isn't too much later than T but it depends on the > > > frequency of checkpoints, the length of notification queue, etc. so it is > > > difficult to give hard guarantees. > > > > > > And the argument why the scheme maintains the invariant is that if > > > POST_WRITE is generated after "clear ignore marks" finishes, it will get > > > delivered and thus F will get reported as modified in some checkpoint once > > > the event is processed. If POST_WRITE gets generated before "clear ignore > > > marks" finishes and F is among ignored inodes, it means F is already in > > > modified set so it will get reported as part of checkpoint T. Also > > > application will already see modified data when processing list of modified > > > files in checkpoint T. > > > > > > Simple and we don't even need PRE_WRITE here. But maybe you wanted to > > > provide stronger invariant? Like "you are not able to see modified data > > > without seeing F as modified?" But what exactly would be a guarantee here? > > > I feel I'm missing something here but I cannot quite grasp it at this > > > moment... > > > > > > > This is the very basic guarantee that the persistent change tracking snapshots > > need to provide. If a crash happens after modification is done and before > > modification is recorded, we won't detect the modification after reboot. > > Right, crash safety was the point I was missing ;) Thanks for reminding me. > And now I also see why you use filesystem freezing - it is a way to make > things crash safe as otherwise it is difficult to guard against a race > > generate PRE_WRITE for F > PRE_WRITE ignored because file is already > modified > checkpoint happens -> F reported as modified > contents of F fetched > > modify data > transaction commit > <crash> > POST_WRITE never seen so change to F is > never reported > > I just think filesystem freezing is too big hammer for widespread use of > persistent change tracking. Note that fsfreeze is also needed to flush dirty data after modify data, not only to wait for modify data transaction commit. Otherwise the fetched contents of F will differ from contents of F after reboot even if we did wait for POST_WRITE. However, in this case, file contents can be considered corrupted and rsync, for example, will not detect the change either, because mtime does match the previously fetched value. As long as applications write files safely (with rename) fsfreeze is not needed, but for strict change tracking, fsfreeze is needed, so fsfreeze is a policy decision of HSM. > Can't we introduce some SRCU lock / unlock into > file_start_write() / file_end_write() and then invoke synchronize_srcu() > during checkpoint after removing ignore marks? It will be much cheaper as > we don't have to flush all dirty data to disk and also writes can keep > flowing while we wait for outstanding writes straddling checkpoint to > complete. What do you think? Maybe, but this is not enough. Note that my patches [1] are overlapping fsnotify_mark_srcu with file_start_write(), so we would need to overlay fsnotify_mark_srcu with this new "modify SRCU". [1] https://github.com/amir73il/linux/commits/fanotify_pre_content > > The checkpoint would then do: > start gathering changes for both T and T+1 > clear ignore marks > synchronize_srcu() > stop gathering changes for T and report them > > And in this case we would not need POST_WRITE as an event. > Why then give up on the POST_WRITE events idea? Don't you think it could work? Or is it just because you think the generic API would be useful to others? > The technical problem I see is how to deal with AIO / io_uring because > SRCU needs to be released in the same context as it is acquired - that > would need to be consulted with Paul McKenney if we can make it work. And > another problem I see is that it might not be great to have this > system-wide as e.g. on networking filesystems or pipes writes can block for > really long. > > Final question is how to expose this to userspace because this > functionality would seem useful outside of filesystem notification space so > probably do not need to tie it to that. > > Or we could simplify our life somewhat and acquire SRCU when generating > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU > within fsnotify and would mitigate the problems coming from system-wide > SRCU. OTOH it will create problems when PRE_WRITE gets generated and > POST_WRITE would not for some reason. Just branstorming here, I've not > really decided what's better... > What if checkpoint only acquired (and released) exclusive sb_writers without flushing dirty data. Wouldn't that be equivalent to the synchronize_srcu() you suggested? > > Maybe "checkpoint" was a bad name to use for this handover between > > two subsequent change tracking snapshots. > > I'm getting used to the terminology :) But to me it still seems more > natural to look at the situation as a single stream of events where we fetch > bulks of changes at certain moments, rather than looking at it as certain > entities collecting events for different time intervals. > I always used "snapshot take" as terminology. I just now started to use "checkpoint" for this userspace HSM implementation. I have no objection to single stream, nor to "flush all evictable" command. I will try to start with this implementation for POC. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-15 13:08 ` Amir Goldstein @ 2022-11-16 10:56 ` Jan Kara 2022-11-16 16:24 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-16 10:56 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner On Tue 15-11-22 15:08:17, Amir Goldstein wrote: > On Tue, Nov 15, 2022 at 12:16 PM Jan Kara <jack@suse.cz> wrote: > > > > On Mon 14-11-22 22:08:16, Amir Goldstein wrote: > > > On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@suse.cz> wrote: > > > > > My understanding is that > > > > > synchronize_srcu(&fsnotify_mark_srcu); > > > > > is needed as barrier between 3) and 4) > > > > > > > > > > In any case, even if CHECKPOINT_PENDING can work, > > > > > with or without FAN_MARK_SYNC, to me personally, understanding > > > > > the proof of correctness of alternating groups model is very easy, > > > > > while proving correctness for CHECKPOINT_PENDING model is > > > > > something that I was not yet able to accomplish. > > > > > > > > I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I > > > > don't find your scheme with two groups simpler ;) Let me try to write down > > > > rationale for my scheme, I think I can even somewhat simplify it: > > > > > > > > Write operation consists of: > > > > generate PRE_WRITE on F > > > > modify data of F > > > > generate POST_WRITE on F > > > > > > > > Checkpoint consists of: > > > > clear ignore marks > > > > report files for which we received PRE_WRITE or POST_WRITE until this > > > > moment > > > > > > > > And the invariant we try to provide is: If file F was modified during > > > > checkpoint T, then we report F as modified during T or some later > > > > checkpoint. Where it is matter of quality of implementation that "some > > > > later checkpoint" isn't too much later than T but it depends on the > > > > frequency of checkpoints, the length of notification queue, etc. so it is > > > > difficult to give hard guarantees. > > > > > > > > And the argument why the scheme maintains the invariant is that if > > > > POST_WRITE is generated after "clear ignore marks" finishes, it will get > > > > delivered and thus F will get reported as modified in some checkpoint once > > > > the event is processed. If POST_WRITE gets generated before "clear ignore > > > > marks" finishes and F is among ignored inodes, it means F is already in > > > > modified set so it will get reported as part of checkpoint T. Also > > > > application will already see modified data when processing list of modified > > > > files in checkpoint T. > > > > > > > > Simple and we don't even need PRE_WRITE here. But maybe you wanted to > > > > provide stronger invariant? Like "you are not able to see modified data > > > > without seeing F as modified?" But what exactly would be a guarantee here? > > > > I feel I'm missing something here but I cannot quite grasp it at this > > > > moment... > > > > > > > > > > This is the very basic guarantee that the persistent change tracking snapshots > > > need to provide. If a crash happens after modification is done and before > > > modification is recorded, we won't detect the modification after reboot. > > > > Right, crash safety was the point I was missing ;) Thanks for reminding me. > > And now I also see why you use filesystem freezing - it is a way to make > > things crash safe as otherwise it is difficult to guard against a race > > > > generate PRE_WRITE for F > > PRE_WRITE ignored because file is already > > modified > > checkpoint happens -> F reported as modified > > contents of F fetched > > > > modify data > > transaction commit > > <crash> > > POST_WRITE never seen so change to F is > > never reported > > > > I just think filesystem freezing is too big hammer for widespread use of > > persistent change tracking. > > Note that fsfreeze is also needed to flush dirty data after modify data, > not only to wait for modify data transaction commit. > > Otherwise the fetched contents of F will differ from contents of F > after reboot even if we did wait for POST_WRITE. > > However, in this case, file contents can be considered corrupted > and rsync, for example, will not detect the change either, because > mtime does match the previously fetched value. > > As long as applications write files safely (with rename) fsfreeze is not > needed, but for strict change tracking, fsfreeze is needed, so fsfreeze > is a policy decision of HSM. Yeah, I agree. It is a policy decision. > > Can't we introduce some SRCU lock / unlock into > > file_start_write() / file_end_write() and then invoke synchronize_srcu() > > during checkpoint after removing ignore marks? It will be much cheaper as > > we don't have to flush all dirty data to disk and also writes can keep > > flowing while we wait for outstanding writes straddling checkpoint to > > complete. What do you think? > > Maybe, but this is not enough. > Note that my patches [1] are overlapping fsnotify_mark_srcu with > file_start_write(), so we would need to overlay fsnotify_mark_srcu > with this new "modify SRCU". > > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather have the "modify SRCU" cover the whole region we need - which means including the generation of PRE_MODIFY event. > > The checkpoint would then do: > > start gathering changes for both T and T+1 > > clear ignore marks > > synchronize_srcu() > > stop gathering changes for T and report them > > > > And in this case we would not need POST_WRITE as an event. > > > > Why then give up on the POST_WRITE events idea? > Don't you think it could work? So as we are discussing, the POST_WRITE event is not useful when we want to handle crash safety. And if we have some other mechanism (like SRCU) which is able to guarantee crash safety, then what is the benefit of POST_WRITE? I'm not against POST_WRITE, I just don't see much value in it if we have another mechanism to deal with events straddling checkpoint. > > The technical problem I see is how to deal with AIO / io_uring because > > SRCU needs to be released in the same context as it is acquired - that > > would need to be consulted with Paul McKenney if we can make it work. And > > another problem I see is that it might not be great to have this > > system-wide as e.g. on networking filesystems or pipes writes can block for > > really long. > > > > Final question is how to expose this to userspace because this > > functionality would seem useful outside of filesystem notification space so > > probably do not need to tie it to that. > > > > Or we could simplify our life somewhat and acquire SRCU when generating > > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU > > within fsnotify and would mitigate the problems coming from system-wide > > SRCU. OTOH it will create problems when PRE_WRITE gets generated and > > POST_WRITE would not for some reason. Just branstorming here, I've not > > really decided what's better... > > > > What if checkpoint only acquired (and released) exclusive sb_writers without > flushing dirty data. > Wouldn't that be equivalent to the synchronize_srcu() you suggested? In terms of guarantees it would be equivalent. In terms of impact on the system it will be considerably worse. Because SRCU allows new SRCU readers to start while synchronize_srcu() is running - so in our case new writes can freely run while we are waiting for pending writes to complete. So impact of the synchronize_srcu() on system activity will be practically unnoticeable. If we use sb_writers as you suggest, it will block all write activity until all writes finish. Which can be significant amount of time if you have e.g. write(1 GB of data) running. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-16 10:56 ` Jan Kara @ 2022-11-16 16:24 ` Amir Goldstein 2022-11-17 12:38 ` Amir Goldstein ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Amir Goldstein @ 2022-11-16 16:24 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner > > > Can't we introduce some SRCU lock / unlock into > > > file_start_write() / file_end_write() and then invoke synchronize_srcu() > > > during checkpoint after removing ignore marks? It will be much cheaper as > > > we don't have to flush all dirty data to disk and also writes can keep > > > flowing while we wait for outstanding writes straddling checkpoint to > > > complete. What do you think? > > > > Maybe, but this is not enough. > > Note that my patches [1] are overlapping fsnotify_mark_srcu with > > file_start_write(), so we would need to overlay fsnotify_mark_srcu > > with this new "modify SRCU". > > > > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content > > Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather > have the "modify SRCU" cover the whole region we need - which means > including the generation of PRE_MODIFY event. > Yeh, it would be great if we can pull this off. > > > The checkpoint would then do: > > > start gathering changes for both T and T+1 > > > clear ignore marks > > > synchronize_srcu() > > > stop gathering changes for T and report them > > > > > > And in this case we would not need POST_WRITE as an event. > > > > > > > Why then give up on the POST_WRITE events idea? > > Don't you think it could work? > > So as we are discussing, the POST_WRITE event is not useful when we want to > handle crash safety. And if we have some other mechanism (like SRCU) which > is able to guarantee crash safety, then what is the benefit of POST_WRITE? > I'm not against POST_WRITE, I just don't see much value in it if we have > another mechanism to deal with events straddling checkpoint. > Not sure I follow. I think that crash safety can be achieved also with PRE/POST_WRITE: - PRE_WRITE records an intent to write in persistent snapshot T and add to in-memory map of in-progress writes of period T - When "checkpoint T" starts, new PRE_WRITES are recorded in both T and T+1 persistent snapshots, but event is added only to in-memory map of in-progress writes of period T+1 - "checkpoint T" ends when all in-progress writes of T are completed The trick with alternating snapshots "handover" is this (perhaps I never explained it and I need to elaborate on the wiki [1]): [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query The changed files query results need to include recorded changes in both "finalizing" snapshot T and the new snapshot T+1 that was started in the beginning of the query. Snapshot T MUST NOT be discarded until checkpoint/handover is complete AND the query results that contain changes recorded in T and T+1 snapshots have been consumed. When the consumer ACKs that the query results have been safely stored or acted upon (I called this operation "bless" snapshot T+1) then and only then can snapshot T be discarded. After snapshot T is discarded a new query will start snapshot T+2. A changed files query result includes the id of the last blessed snapshot. I think this is more or less equivalent to the SRCU that you suggested, but all the work is done in userspace at application level. If you see any problem with this scheme or don't understand it please let me know and I will try to explain better. > > > The technical problem I see is how to deal with AIO / io_uring because > > > SRCU needs to be released in the same context as it is acquired - that > > > would need to be consulted with Paul McKenney if we can make it work. And > > > another problem I see is that it might not be great to have this > > > system-wide as e.g. on networking filesystems or pipes writes can block for > > > really long. > > > > > > Final question is how to expose this to userspace because this > > > functionality would seem useful outside of filesystem notification space so > > > probably do not need to tie it to that. > > > > > > Or we could simplify our life somewhat and acquire SRCU when generating > > > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU > > > within fsnotify and would mitigate the problems coming from system-wide > > > SRCU. OTOH it will create problems when PRE_WRITE gets generated and > > > POST_WRITE would not for some reason. Just branstorming here, I've not > > > really decided what's better... > > > Seems there are several non trivial challenges to surmount with this "userspace modification SRCU" idea. For now, I will stay in my comfort zone and try to make the POC with PRE/POST_WRITE work and write the proof of correctness. I will have no objection at all if you figure out how to solve those issues and guide me to a path for implementing sb_write_srcu. It will make the userspace implementation much simpler, getting rid of the in-progress writes in-memory tracking. > > > > What if checkpoint only acquired (and released) exclusive sb_writers without > > flushing dirty data. > > Wouldn't that be equivalent to the synchronize_srcu() you suggested? > > In terms of guarantees it would be equivalent. In terms of impact on the > system it will be considerably worse. Because SRCU allows new SRCU readers > to start while synchronize_srcu() is running - so in our case new writes > can freely run while we are waiting for pending writes to complete. So > impact of the synchronize_srcu() on system activity will be practically > unnoticeable. If we use sb_writers as you suggest, it will block all write > activity until all writes finish. Which can be significant amount of time > if you have e.g. write(1 GB of data) running. > Of course, it was a silly idea. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-16 16:24 ` Amir Goldstein @ 2022-11-17 12:38 ` Amir Goldstein 2022-11-23 10:49 ` Jan Kara 2022-11-21 16:40 ` Amir Goldstein 2022-11-23 10:10 ` Jan Kara 2 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-11-17 12:38 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner > > > > The checkpoint would then do: > > > > start gathering changes for both T and T+1 > > > > clear ignore marks > > > > synchronize_srcu() > > > > stop gathering changes for T and report them > > > > > > > > And in this case we would not need POST_WRITE as an event. > > > > > > > > > > Why then give up on the POST_WRITE events idea? > > > Don't you think it could work? > > > > So as we are discussing, the POST_WRITE event is not useful when we want to > > handle crash safety. And if we have some other mechanism (like SRCU) which > > is able to guarantee crash safety, then what is the benefit of POST_WRITE? > > I'm not against POST_WRITE, I just don't see much value in it if we have > > another mechanism to deal with events straddling checkpoint. > > > > Not sure I follow. > > I think that crash safety can be achieved also with PRE/POST_WRITE: > - PRE_WRITE records an intent to write in persistent snapshot T > and add to in-memory map of in-progress writes of period T > - When "checkpoint T" starts, new PRE_WRITES are recorded in both > T and T+1 persistent snapshots, but event is added only to > in-memory map of in-progress writes of period T+1 > - "checkpoint T" ends when all in-progress writes of T are completed > > The trick with alternating snapshots "handover" is this > (perhaps I never explained it and I need to elaborate on the wiki [1]): > > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query > > The changed files query results need to include recorded changes in both > "finalizing" snapshot T and the new snapshot T+1 that was started in > the beginning of the query. > > Snapshot T MUST NOT be discarded until checkpoint/handover > is complete AND the query results that contain changes recorded > in T and T+1 snapshots have been consumed. > > When the consumer ACKs that the query results have been safely stored > or acted upon (I called this operation "bless" snapshot T+1) then and > only then can snapshot T be discarded. > > After snapshot T is discarded a new query will start snapshot T+2. > A changed files query result includes the id of the last blessed snapshot. > > I think this is more or less equivalent to the SRCU that you suggested, > but all the work is done in userspace at application level. > > If you see any problem with this scheme or don't understand it > please let me know and I will try to explain better. > Hmm I guess "crash safety" is not well defined. You and I were talking about "system crash" and indeed, this was my only concern with kernel implementation of overlayfs watch. But with userspace HSM service, how can we guarantee that modifications did not happen while the service is down? I don't really have a good answer for this. Thinking out loud, we would somehow need to make the default permission deny for all modifications, maybe through some mount property (e.g. MOUNT_ATTR_PROT_READ), causing the pre-write hooks to default to EROFS if there is no "vfs filter" mount mark. Then it will be possible to expose a "safe" mount to users, where modifications can never go unnoticed even when HSM service crashes. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-17 12:38 ` Amir Goldstein @ 2022-11-23 10:49 ` Jan Kara 2022-11-23 13:07 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-23 10:49 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner On Thu 17-11-22 14:38:51, Amir Goldstein wrote: > > > > > The checkpoint would then do: > > > > > start gathering changes for both T and T+1 > > > > > clear ignore marks > > > > > synchronize_srcu() > > > > > stop gathering changes for T and report them > > > > > > > > > > And in this case we would not need POST_WRITE as an event. > > > > > > > > > > > > > Why then give up on the POST_WRITE events idea? > > > > Don't you think it could work? > > > > > > So as we are discussing, the POST_WRITE event is not useful when we want to > > > handle crash safety. And if we have some other mechanism (like SRCU) which > > > is able to guarantee crash safety, then what is the benefit of POST_WRITE? > > > I'm not against POST_WRITE, I just don't see much value in it if we have > > > another mechanism to deal with events straddling checkpoint. > > > > > > > Not sure I follow. > > > > I think that crash safety can be achieved also with PRE/POST_WRITE: > > - PRE_WRITE records an intent to write in persistent snapshot T > > and add to in-memory map of in-progress writes of period T > > - When "checkpoint T" starts, new PRE_WRITES are recorded in both > > T and T+1 persistent snapshots, but event is added only to > > in-memory map of in-progress writes of period T+1 > > - "checkpoint T" ends when all in-progress writes of T are completed > > > > The trick with alternating snapshots "handover" is this > > (perhaps I never explained it and I need to elaborate on the wiki [1]): > > > > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query > > > > The changed files query results need to include recorded changes in both > > "finalizing" snapshot T and the new snapshot T+1 that was started in > > the beginning of the query. > > > > Snapshot T MUST NOT be discarded until checkpoint/handover > > is complete AND the query results that contain changes recorded > > in T and T+1 snapshots have been consumed. > > > > When the consumer ACKs that the query results have been safely stored > > or acted upon (I called this operation "bless" snapshot T+1) then and > > only then can snapshot T be discarded. > > > > After snapshot T is discarded a new query will start snapshot T+2. > > A changed files query result includes the id of the last blessed snapshot. > > > > I think this is more or less equivalent to the SRCU that you suggested, > > but all the work is done in userspace at application level. > > > > If you see any problem with this scheme or don't understand it > > please let me know and I will try to explain better. > > > > Hmm I guess "crash safety" is not well defined. > You and I were talking about "system crash" and indeed, this was > my only concern with kernel implementation of overlayfs watch. > > But with userspace HSM service, how can we guarantee that > modifications did not happen while the service is down? > > I don't really have a good answer for this. Very good point! > Thinking out loud, we would somehow need to make the default > permission deny for all modifications, maybe through some mount > property (e.g. MOUNT_ATTR_PROT_READ), causing the pre-write > hooks to default to EROFS if there is no "vfs filter" mount mark. > > Then it will be possible to expose a "safe" mount to users, where > modifications can never go unnoticed even when HSM service > crashes. Yeah, something like this. Although the bootstrap of this during mount may be a bit challenging. But maybe not. Also I'm thinking about other usecases - for HSM I agree we essentially need to take the FS down if the userspace counterpart is not working. What about other persistent change log usecases? Do we mandate that there is only one "persistent change log" daemon in the system (or per filesystem?) and that must be running or we take the filesystem down? And anybody who wants reliable notifications needs to consume service of this daemon? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-23 10:49 ` Jan Kara @ 2022-11-23 13:07 ` Amir Goldstein 0 siblings, 0 replies; 43+ messages in thread From: Amir Goldstein @ 2022-11-23 13:07 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner > > Hmm I guess "crash safety" is not well defined. > > You and I were talking about "system crash" and indeed, this was > > my only concern with kernel implementation of overlayfs watch. > > > > But with userspace HSM service, how can we guarantee that > > modifications did not happen while the service is down? > > > > I don't really have a good answer for this. > > Very good point! > > > Thinking out loud, we would somehow need to make the default > > permission deny for all modifications, maybe through some mount > > property (e.g. MOUNT_ATTR_PROT_READ), causing the pre-write > > hooks to default to EROFS if there is no "vfs filter" mount mark. > > > > Then it will be possible to expose a "safe" mount to users, where > > modifications can never go unnoticed even when HSM service > > crashes. > > Yeah, something like this. Although the bootstrap of this during mount may > be a bit challenging. But maybe not. > I don't think so. As I wrote on several occasions, some of the current HSMs are implemented as FUSE filesystems and require mount. As I imagine an HSM system (and as our in-house system works) there is a filesystem containing populated and unpopulated files that admin can access without any filters and there is a mount that is exposed to users where the filtering and on-demand populate happens. I am less worried about bringup. My HttpDirFS POC already does mount move of a marked mount on startup. My concern was how to handle dying fanotify group safely. > Also I'm thinking about other usecases - for HSM I agree we essentially > need to take the FS down if the userspace counterpart is not working. What > about other persistent change log usecases? Do we mandate that there is > only one "persistent change log" daemon in the system (or per filesystem?) > and that must be running or we take the filesystem down? And anybody who > wants reliable notifications needs to consume service of this daemon? Yes, I envision a single systemd-fsmonitor daemon (or instance per sb) that can handle subscribing to changes on subtree and can deal with the permission of dispatching events on subtrees. To answer your question, I think the bare minimum that we need to provide is a property of the mount (probably an event mask) that requires at least one active fanotify vfs filter to allow certain permission events to go through. I think it would make sense to allow a single FAN_CLASS_VFS_FILTER group mark per sb and one per mount. If use cases that require more vfs filters per sb/mount arise, we can revisit that restriction later. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-16 16:24 ` Amir Goldstein 2022-11-17 12:38 ` Amir Goldstein @ 2022-11-21 16:40 ` Amir Goldstein 2022-11-23 12:11 ` Jan Kara 2022-11-23 10:10 ` Jan Kara 2 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-11-21 16:40 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Wed, Nov 16, 2022 at 6:24 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > Can't we introduce some SRCU lock / unlock into > > > > file_start_write() / file_end_write() and then invoke synchronize_srcu() > > > > during checkpoint after removing ignore marks? It will be much cheaper as > > > > we don't have to flush all dirty data to disk and also writes can keep > > > > flowing while we wait for outstanding writes straddling checkpoint to > > > > complete. What do you think? > > > > > > Maybe, but this is not enough. > > > Note that my patches [1] are overlapping fsnotify_mark_srcu with > > > file_start_write(), so we would need to overlay fsnotify_mark_srcu > > > with this new "modify SRCU". > > > > > > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content > > > > Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather > > have the "modify SRCU" cover the whole region we need - which means > > including the generation of PRE_MODIFY event. > > > > Yeh, it would be great if we can pull this off. OK. I decided to give this a shot, see: https://github.com/amir73il/linux/commits/sb_write_barrier It is just a sketch to show the idea, very lightly tested. What I did is, instead of converting all the sb_start,end_write() call sites, which would be a huge change, only callers of sb_start,end_write_srcu() participate in the "modify SRCU". I then converted all the dir modify call sites and some other call sites to use helpers that take SRCU and call pre-modify hooks. [...] > > > > The technical problem I see is how to deal with AIO / io_uring because > > > > SRCU needs to be released in the same context as it is acquired - that > > > > would need to be consulted with Paul McKenney if we can make it work. And > > > > another problem I see is that it might not be great to have this > > > > system-wide as e.g. on networking filesystems or pipes writes can block for > > > > really long. > > > > I averted this problem for now - file data writes are not covered by s_write_srcu with my POC branch. The rationale is that with file data write, HSM would anyway need to use fsfreeze() to get any guarantee, so maybe s_write_srcu is not really useful here?? It might be useful to use s_write_srcu to cover the pre-modify event up to after sb_start_write() in file write/aio/io_uring, but not byond it, so that sb_write_barrier()+fsfreeze() will provide full coverage for in-progress writes. Please let me know if this plan sounds reasonable. > > > > Final question is how to expose this to userspace because this > > > > functionality would seem useful outside of filesystem notification space so > > > > probably do not need to tie it to that. In the current POC branch, nothing calls sb_write_barrier() yet. I was thinking of using it when flushing marks, maybe with the FAN_MARK_SYNC flag that I proposed. For general purpose API, the semantics would need to be better defined, as with this "opt-in" implementation, only some of the modification operations are covered by the 'sb write barrier'. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-21 16:40 ` Amir Goldstein @ 2022-11-23 12:11 ` Jan Kara 2022-11-23 13:30 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-23 12:11 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner On Mon 21-11-22 18:40:06, Amir Goldstein wrote: > On Wed, Nov 16, 2022 at 6:24 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > Can't we introduce some SRCU lock / unlock into > > > > > file_start_write() / file_end_write() and then invoke synchronize_srcu() > > > > > during checkpoint after removing ignore marks? It will be much cheaper as > > > > > we don't have to flush all dirty data to disk and also writes can keep > > > > > flowing while we wait for outstanding writes straddling checkpoint to > > > > > complete. What do you think? > > > > > > > > Maybe, but this is not enough. > > > > Note that my patches [1] are overlapping fsnotify_mark_srcu with > > > > file_start_write(), so we would need to overlay fsnotify_mark_srcu > > > > with this new "modify SRCU". > > > > > > > > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content > > > > > > Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather > > > have the "modify SRCU" cover the whole region we need - which means > > > including the generation of PRE_MODIFY event. > > > > > > > Yeh, it would be great if we can pull this off. > > OK. I decided to give this a shot, see: > > https://github.com/amir73il/linux/commits/sb_write_barrier > > It is just a sketch to show the idea, very lightly tested. > What I did is, instead of converting all the sb_start,end_write() > call sites, which would be a huge change, only callers of > sb_start,end_write_srcu() participate in the "modify SRCU". > > I then converted all the dir modify call sites and some other > call sites to use helpers that take SRCU and call pre-modify hooks. > > [...] I've glanced over the changes and yes, that's what I was imagining :). > > > > > The technical problem I see is how to deal with AIO / io_uring because > > > > > SRCU needs to be released in the same context as it is acquired - that > > > > > would need to be consulted with Paul McKenney if we can make it work. And > > > > > another problem I see is that it might not be great to have this > > > > > system-wide as e.g. on networking filesystems or pipes writes can block for > > > > > really long. > > > > > > > I averted this problem for now - file data writes are not covered by > s_write_srcu with my POC branch. Since you've made the SRCU per sb there is no problem with writes blocking too long on some filesystems. I've asked RCU guys about the problem with SRCU being acquired / released from different contexts. Logically, it seems it should be possible to make this work but maybe I miss some technical detail. > The rationale is that with file data write, HSM would anyway need to > use fsfreeze() to get any guarantee, so maybe s_write_srcu is not really > useful here?? > > It might be useful to use s_write_srcu to cover the pre-modify event > up to after sb_start_write() in file write/aio/io_uring, but not byond it, > so that sb_write_barrier()+fsfreeze() will provide full coverage for > in-progress writes. > > Please let me know if this plan sounds reasonable. Let's see what RCU guys reply. I'd prefer to cover the whole write for simplicity if it is reasonably possible. > > > > > Final question is how to expose this to userspace because this > > > > > functionality would seem useful outside of filesystem notification space so > > > > > probably do not need to tie it to that. > > In the current POC branch, nothing calls sb_write_barrier() yet. > I was thinking of using it when flushing marks, maybe with the > FAN_MARK_SYNC flag that I proposed. Yes, that's what I'd imagine as well. > For general purpose API, the semantics would need to be better > defined, as with this "opt-in" implementation, only some of the > modification operations are covered by the 'sb write barrier'. Yes, for now we can keep things internal to fsnotify. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-23 12:11 ` Jan Kara @ 2022-11-23 13:30 ` Amir Goldstein 0 siblings, 0 replies; 43+ messages in thread From: Amir Goldstein @ 2022-11-23 13:30 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Wed, Nov 23, 2022 at 2:11 PM Jan Kara <jack@suse.cz> wrote: > > On Mon 21-11-22 18:40:06, Amir Goldstein wrote: > > On Wed, Nov 16, 2022 at 6:24 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > Can't we introduce some SRCU lock / unlock into > > > > > > file_start_write() / file_end_write() and then invoke synchronize_srcu() > > > > > > during checkpoint after removing ignore marks? It will be much cheaper as > > > > > > we don't have to flush all dirty data to disk and also writes can keep > > > > > > flowing while we wait for outstanding writes straddling checkpoint to > > > > > > complete. What do you think? > > > > > > > > > > Maybe, but this is not enough. > > > > > Note that my patches [1] are overlapping fsnotify_mark_srcu with > > > > > file_start_write(), so we would need to overlay fsnotify_mark_srcu > > > > > with this new "modify SRCU". > > > > > > > > > > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content > > > > > > > > Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather > > > > have the "modify SRCU" cover the whole region we need - which means > > > > including the generation of PRE_MODIFY event. > > > > > > > > > > Yeh, it would be great if we can pull this off. > > > > OK. I decided to give this a shot, see: > > > > https://github.com/amir73il/linux/commits/sb_write_barrier > > > > It is just a sketch to show the idea, very lightly tested. > > What I did is, instead of converting all the sb_start,end_write() > > call sites, which would be a huge change, only callers of > > sb_start,end_write_srcu() participate in the "modify SRCU". > > > > I then converted all the dir modify call sites and some other > > call sites to use helpers that take SRCU and call pre-modify hooks. > > > > [...] > > I've glanced over the changes and yes, that's what I was imagining :). FYI, just pushed an update to that branch which includes *write_srcu wrappers for all the call sites of security_file_permission(file, MAY_WRITE), mostly file_start_write_area()/file_end_write_srcu(). For async write, write_srcu only covers the pre-modify event and the io submittion. > > > > > > > The technical problem I see is how to deal with AIO / io_uring because > > > > > > SRCU needs to be released in the same context as it is acquired - that > > > > > > would need to be consulted with Paul McKenney if we can make it work. And > > > > > > another problem I see is that it might not be great to have this > > > > > > system-wide as e.g. on networking filesystems or pipes writes can block for > > > > > > really long. > > > > > > > > > > I averted this problem for now - file data writes are not covered by > > s_write_srcu with my POC branch. > > Since you've made the SRCU per sb there is no problem with writes blocking > too long on some filesystems. I've asked RCU guys about the problem with > SRCU being acquired / released from different contexts. Logically, it seems > it should be possible to make this work but maybe I miss some technical > detail. > > > The rationale is that with file data write, HSM would anyway need to > > use fsfreeze() to get any guarantee, so maybe s_write_srcu is not really > > useful here?? > > That was a wrong statement. For data writes HSM would need to do syncfs() after sb_write_barrier(), not fsfreeze. With the current POC branch, fsfreeze would only be needed to wait for the async write completions. > > It might be useful to use s_write_srcu to cover the pre-modify event > > up to after sb_start_write() in file write/aio/io_uring, but not byond it, > > so that sb_write_barrier()+fsfreeze() will provide full coverage for > > in-progress writes. > > > > Please let me know if this plan sounds reasonable. > > Let's see what RCU guys reply. I'd prefer to cover the whole write for > simplicity if it is reasonably possible. > OK. Other backup plans: - Set a flag in pre-modify events from async write so HSM knowns that event is not "atomic" with modification - HSM may deny all events on this sort if it needs to record a change or it may mark the snapshot "requires freeze" - Reliably deliver post-write events only for the async write completions (also on error), so HSM can track only those event pairs Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-16 16:24 ` Amir Goldstein 2022-11-17 12:38 ` Amir Goldstein 2022-11-21 16:40 ` Amir Goldstein @ 2022-11-23 10:10 ` Jan Kara 2022-11-23 15:16 ` Amir Goldstein 2 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2022-11-23 10:10 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Dave Chinner On Wed 16-11-22 18:24:06, Amir Goldstein wrote: > > > Why then give up on the POST_WRITE events idea? > > > Don't you think it could work? > > > > So as we are discussing, the POST_WRITE event is not useful when we want to > > handle crash safety. And if we have some other mechanism (like SRCU) which > > is able to guarantee crash safety, then what is the benefit of POST_WRITE? > > I'm not against POST_WRITE, I just don't see much value in it if we have > > another mechanism to deal with events straddling checkpoint. > > > > Not sure I follow. > > I think that crash safety can be achieved also with PRE/POST_WRITE: > - PRE_WRITE records an intent to write in persistent snapshot T > and add to in-memory map of in-progress writes of period T > - When "checkpoint T" starts, new PRE_WRITES are recorded in both > T and T+1 persistent snapshots, but event is added only to > in-memory map of in-progress writes of period T+1 > - "checkpoint T" ends when all in-progress writes of T are completed So maybe I miss something but suppose the situation I was mentioning few emails earlier: PRE_WRITE for F -> F recorded as modified in T modify F POST_WRITE for F PRE_WRITE for F -> ignored because F is already marked as modified -> checkpoint T requested, modified files reported, process modified files modify F --------- crash Now unless filesystem freeze or SRCU is part of checkpoint, we will never notify about the last modification to F. So I don't see how PRE + POST_WRITE alone can achieve crash safety... And if we use filesystem freeze or SRCU as part of checkpoint, then processing of POST_WRITE events does not give us anything new. E.g. synchronize_srcu() during checkpoing before handing out list of modified files makes sure all modifications to files for which PRE_MODIFY events were generated (and thus are listed as modified in checkpoint T) are visible for userspace. So am I missing some case where POST_WRITE would be more useful than SRCU? Because at this point I'd rather implement SRCU than POST_WRITE. > The trick with alternating snapshots "handover" is this > (perhaps I never explained it and I need to elaborate on the wiki [1]): > > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query > > The changed files query results need to include recorded changes in both > "finalizing" snapshot T and the new snapshot T+1 that was started in > the beginning of the query. > > Snapshot T MUST NOT be discarded until checkpoint/handover > is complete AND the query results that contain changes recorded > in T and T+1 snapshots have been consumed. > > When the consumer ACKs that the query results have been safely stored > or acted upon (I called this operation "bless" snapshot T+1) then and > only then can snapshot T be discarded. > > After snapshot T is discarded a new query will start snapshot T+2. > A changed files query result includes the id of the last blessed snapshot. > > I think this is more or less equivalent to the SRCU that you suggested, > but all the work is done in userspace at application level. > > If you see any problem with this scheme or don't understand it > please let me know and I will try to explain better. So until now I was imagining that query results will be returned like a one big memcpy. I.e. one off event where the "persistent log daemon" hands over the whole contents of checkpoint T to the client. Whatever happens with the returned data is the bussiness of the client, whatever happens with the checkpoint T records in the daemon is the daemon's bussiness. The model you seem to speak about here is somewhat different - more like readdir() kind of approach where client asks for access to checkpoint T data, daemon provides the data record by record (probably serving the data from its files on disk), and when the client is done and "closes" checkpoint T, daemon's records about checkpoint T can be erased. Am I getting it right? This however seems somewhat orthogonal to the SRCU idea. SRCU essentially serves the only purpose - make sure that modifications to all files for which we have received PRE_WRITE event are visible in respective files. > > > > The technical problem I see is how to deal with AIO / io_uring because > > > > SRCU needs to be released in the same context as it is acquired - that > > > > would need to be consulted with Paul McKenney if we can make it work. And > > > > another problem I see is that it might not be great to have this > > > > system-wide as e.g. on networking filesystems or pipes writes can block for > > > > really long. > > > > > > > > Final question is how to expose this to userspace because this > > > > functionality would seem useful outside of filesystem notification space so > > > > probably do not need to tie it to that. > > > > > > > > Or we could simplify our life somewhat and acquire SRCU when generating > > > > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU > > > > within fsnotify and would mitigate the problems coming from system-wide > > > > SRCU. OTOH it will create problems when PRE_WRITE gets generated and > > > > POST_WRITE would not for some reason. Just branstorming here, I've not > > > > really decided what's better... > > Seems there are several non trivial challenges to surmount with this > "userspace modification SRCU" idea. > > For now, I will stay in my comfort zone and try to make the POC > with PRE/POST_WRITE work and write the proof of correctness. > > I will have no objection at all if you figure out how to solve those > issues and guide me to a path for implementing sb_write_srcu. > It will make the userspace implementation much simpler, getting rid > of the in-progress writes in-memory tracking. It seems you have progressed on this front yourself so let's continue there :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-11-23 10:10 ` Jan Kara @ 2022-11-23 15:16 ` Amir Goldstein 0 siblings, 0 replies; 43+ messages in thread From: Amir Goldstein @ 2022-11-23 15:16 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner On Wed, Nov 23, 2022 at 12:10 PM Jan Kara <jack@suse.cz> wrote: > > On Wed 16-11-22 18:24:06, Amir Goldstein wrote: > > > > Why then give up on the POST_WRITE events idea? > > > > Don't you think it could work? > > > > > > So as we are discussing, the POST_WRITE event is not useful when we want to > > > handle crash safety. And if we have some other mechanism (like SRCU) which > > > is able to guarantee crash safety, then what is the benefit of POST_WRITE? > > > I'm not against POST_WRITE, I just don't see much value in it if we have > > > another mechanism to deal with events straddling checkpoint. > > > > > > > Not sure I follow. > > > > I think that crash safety can be achieved also with PRE/POST_WRITE: > > - PRE_WRITE records an intent to write in persistent snapshot T > > and add to in-memory map of in-progress writes of period T > > - When "checkpoint T" starts, new PRE_WRITES are recorded in both > > T and T+1 persistent snapshots, but event is added only to > > in-memory map of in-progress writes of period T+1 > > - "checkpoint T" ends when all in-progress writes of T are completed > > So maybe I miss something but suppose the situation I was mentioning few > emails earlier: > > PRE_WRITE for F -> F recorded as modified in T > modify F > POST_WRITE for F > > PRE_WRITE for F -> ignored because F is already marked as > modified > > -> checkpoint T requested, modified files > reported, process modified files > modify F > --------- crash > > Now unless filesystem freeze or SRCU is part of checkpoint, we will never > notify about the last modification to F. So I don't see how PRE + > POST_WRITE alone can achieve crash safety... > > And if we use filesystem freeze or SRCU as part of checkpoint, then > processing of POST_WRITE events does not give us anything new. E.g. > synchronize_srcu() during checkpoing before handing out list of modified > files makes sure all modifications to files for which PRE_MODIFY events > were generated (and thus are listed as modified in checkpoint T) are > visible for userspace. > > So am I missing some case where POST_WRITE would be more useful than SRCU? > Because at this point I'd rather implement SRCU than POST_WRITE. > I tend to agree. Even if POST_WRITE can be done, SRCU will be far better. > > The trick with alternating snapshots "handover" is this > > (perhaps I never explained it and I need to elaborate on the wiki [1]): > > > > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query > > > > The changed files query results need to include recorded changes in both > > "finalizing" snapshot T and the new snapshot T+1 that was started in > > the beginning of the query. > > > > Snapshot T MUST NOT be discarded until checkpoint/handover > > is complete AND the query results that contain changes recorded > > in T and T+1 snapshots have been consumed. > > > > When the consumer ACKs that the query results have been safely stored > > or acted upon (I called this operation "bless" snapshot T+1) then and > > only then can snapshot T be discarded. > > > > After snapshot T is discarded a new query will start snapshot T+2. > > A changed files query result includes the id of the last blessed snapshot. > > > > I think this is more or less equivalent to the SRCU that you suggested, > > but all the work is done in userspace at application level. > > > > If you see any problem with this scheme or don't understand it > > please let me know and I will try to explain better. > > So until now I was imagining that query results will be returned like a one > big memcpy. I.e. one off event where the "persistent log daemon" hands over > the whole contents of checkpoint T to the client. Whatever happens with the > returned data is the bussiness of the client, whatever happens with the > checkpoint T records in the daemon is the daemon's bussiness. The model you > seem to speak about here is somewhat different - more like readdir() kind > of approach where client asks for access to checkpoint T data, daemon > provides the data record by record (probably serving the data from its > files on disk), and when the client is done and "closes" checkpoint T, > daemon's records about checkpoint T can be erased. Am I getting it right? > Yes, something like that. The query result (which is actually a recursive readdir) could be huge. So it cannot really be returned as a blob, it must be steamed to consumers. > This however seems somewhat orthogonal to the SRCU idea. SRCU essentially > serves the only purpose - make sure that modifications to all files for > which we have received PRE_WRITE event are visible in respective files. > Absolutely right. Sorry for the noise, but at least you've learned one more thing about my persistent change snapshots architecture ;-) Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <BY5PR07MB6529795F49FB4E923AFCB062A3449@BY5PR07MB6529.namprd07.prod.outlook.com>]
* Re: thoughts about fanotify and HSM [not found] ` <BY5PR07MB6529795F49FB4E923AFCB062A3449@BY5PR07MB6529.namprd07.prod.outlook.com> @ 2022-09-14 9:29 ` Jan Kara 0 siblings, 0 replies; 43+ messages in thread From: Jan Kara @ 2022-09-14 9:29 UTC (permalink / raw) To: Plaster, Robert; +Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Miklos Szeredi Hello Robert! On Mon 12-09-22 21:10:24, Plaster, Robert wrote: > HI Jan – Our team has been using fanotify for HSM as a DMAPI replacement > for a while now. We came from StorageTek/Sun/Oracle HSM product line > development teams. We have been working on this for about 5 years and > just this month are supporting end-users. DMAPI was a huge stumbling > block for us. We figured out what minimum set of api calls were needed to > make it work. > > Our experience with fanotify has been fantastic. Not much overhead CPU > load but for boot volumes we do filter out events for swap and other > (there are so many) OS temp files that are really of no concern to HSM. > We can create as many files as the file system on NVMe can without any > back-pressure and the HSM process will go as fast as the target media > supports. I'm glad to hear fanotify is useful for you. > We have tested close to 600M files per HSM client and we keep adding > client files as time permits, we have no coded limits for the number of > HSM clients or max number of files in the repository. Also, the > repository for HSM clients is heterogenous so it allows us to push files > from one client type to another without any transcoding. I asked the guys > doing the actual fanotify part to comment but they said it would be a > couple days as they are heads down on a fix for a customer. > > Currently we have HSM and punch-hole running on xfs and tested it on zfs > (works but client isn’t finalized) and we have Lustre and SpectrumScale > on our to-do list. Basically any FS with extended attributes should work > for HSM and some (not all) will work with punch-hole capabilities. > > We have developed a HSM target for certain object stores (Ceph librados > and we have our own in-house object store) that support stream-IO and of > course any tape technology. We have a replication tool for making an S3 > target look like the source FS but its just replication, not HSM. Until > we get a S3 io-streaming we can’t use it for HSM. Our implementation only > works with our open-source catalog, archive platform. We tried to > announce this capability to the ceph community but we could never get > past their gatekeepers so only people we actually talk to know about it. > > Check out our site (kinda sucks and a little markety) but it’s a good > primer. In it are links to the code and manuals we have done. We have not > put out on github yet but will very soon. We are getting ready to post > some big updates to really simplify installation and configuration and > some bug fixes for some weird edge-cases. Thanks for info and the links! It is interesting to learn something about how users are actually using our code :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-11 18:12 thoughts about fanotify and HSM Amir Goldstein 2022-09-12 12:57 ` Jan Kara @ 2022-09-21 23:27 ` Dave Chinner 2022-09-22 4:35 ` Amir Goldstein 1 sibling, 1 reply; 43+ messages in thread From: Dave Chinner @ 2022-09-21 23:27 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel On Sun, Sep 11, 2022 at 09:12:06PM +0300, Amir Goldstein wrote: > Hi Jan, > > I wanted to consult with you about preliminary design thoughts > for implementing a hierarchical storage manager (HSM) > with fanotify. > > I have been in contact with some developers in the past > who were interested in using fanotify to implement HSM > (to replace old DMAPI implementation). > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > should be enough to implement a basic HSM, but it is not > sufficient for implementing more advanced HSM features. Ah, I wondered where the people with that DMAPI application went all those years ago after I told them they should look into using fanotify to replace the dependency they had on the DMAPI patched XFS that SGI maintained for years for SLES kernels... > Some of the HSM feature that I would like are: > - blocking hook before access to file range and fill that range > - blocking hook before lookup of child and optionally create child Ok, so these are to replace the DMAPI hooks that provided a blocking userspace upcall to the HSM to allow it fetch data from offline teirs that wasn't currently in the filesystem itself. i.e. the inode currently has a hole over that range of data, but before the read can proceed the HSM needs to retreive the data from the remote storage and write it into the local filesystem. I think that you've missed a bunch of blocking notifications that are needed, though. e.g. truncate needs to block while the HSM records the file ranges it is storing offline are now freed. fallocate() needs to block while it waits for the HSM to tell it the ranges of the file that actually contain data and so should need to be taken into account. (e.g. ZERO_RANGE needs to wait for offline data to be invalidated, COLLAPSE_RANGE needs offline data to be recalled just like a read() operation, etc). IOWs, any operation that manipulates the extent map or the data in the file needs a blocking upcall to the HSM so that it can restore and invalidate the offline data across the range of the operation that is about to be performed.... > My thoughts on the UAPI were: > - Allow new combination of FAN_CLASS_PRE_CONTENT > and FAN_REPORT_FID/DFID_NAME > - This combination does not allow any of the existing events > in mask > - It Allows only new events such as FAN_PRE_ACCESS > FAN_PRE_MODIFY and FAN_PRE_LOOKUP > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have > optional file range info > - All the FAN_PRE_ events are called outside vfs locks and > specifically before sb_writers lock as in my fsnotify_pre_modify [1] > POC > > That last part is important because the HSM daemon will > need to make modifications to the accessed file/directory > before allowing the operation to proceed. Yes, and that was the biggest problem with DMAPI - the locking involved. DMAPI operations have to block without holding any locks that the IO path, truncate, fallocate, etc might need, but once they are unblocked they need to regain those locks to allow the operation to proceed. This was by far the ugliest part of the DMAPI patches, and ultimately, the reason why it was never merged. > Naturally that opens the possibility for new userspace > deadlocks. Nothing that is not already possible with permission > event, but maybe deadlocks that are more inviting to trip over. > > I am not sure if we need to do anything about this, but we > could make it easier to ignore events from the HSM daemon > itself if we want to, to make the userspace implementation easier. XFS used "invisible IO" as the mechanism for avoiding sending DMAPI events for operations that we initiated by the HSM to move data into and out of the filesystem. No doubt you've heard us talk about invisible IO in the past - O_NOCMTIME is what that invisible IO has eventually turned into in a modern Linux kernel. We still use that for invisible IO - xfs_fsr uses it for moving data around during online defragmentation. The entire purpose of invisible IO was to provide a path for HSMs and userspace bulk data movers (e.g. HSM aware backup tools like xfsdump) to do IO without generating unnecessary or recursive DMAPI events.... IOWs, if we want a DMAPI replacement, we will need to formalise a method of performing syscall based operations that will not trigger HSM notification events. The other thing that XFS had for DMAPI was persistent storage in the inode of the event mask that inode should report events for. See the di_dmevmask and di_dmstate fields defined in the on-disk inode format here: https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc There's no detail for them, but the event mask indicated what DMAPI events the inode should issue notifications for, and the state field held information about DMAPI operations in progress. The event field is the important one here - if the event field was empty, access to the inode never generated DMAPI events. When the HSM moved data offline, the "READ" event mask bit was set by the HSM and that triggered DMAPI events for any operation that needed to read data or manipulate the extent map. When the data was brought entirely back online, the event masks count be cleared. However, DMAPI also supports dual state operation, where the data in the local filesystem is also duplicated in the offline storage (e.g. immediately after a recall operation). This state can persist until data or layout is changed in the local filesystem, and so there's a "WRITE" event mask as well that allows the filesystem to inform the HSM that data it may have in offline storage is being changed. The state field is there to tell the HSM that an operation was in progress when the system crashed. As part of recovery, the HSM needs to find all the inodes that had DM operations in progress and either complete them or revert them to bring everything back to a consistent state. THe SGI HSMs used the bulkstat interfaces to scan the fs and find inodes that had a non-zero DM state field. This is one of the reasons that having bulkstat scale out to scanning millions of inodes a second ends up being important - coherency checking between the ondisk filesystem state and the userspace offline data tracking databases is a very important admin operation.. The XFS dmapi event and state mask control APIs are now deprecated. The XFS_IOC_FSSETDM ioctl could read and write the values, and the the XFS V1 bulkstat ioctl could read them. There were also flags for things like extent mapping ioctls (FIEMAP equivalent) that ensured looking at the extent map didn't trigger DMAPI events and data recall. I guess what I'm trying to say is that there's a lot more to an efficient implementation of a HSM event notification mechanism than just implementing a couple of blocking upcalls. IF we want something that will replace even simple DMAPI-based HSM use cases, we really need to think through how to support all the operations that a recall operation might needed for and hence have to block. ANd we really should think about how to efficiently filter out unnecessary events so that we don't drown the HSM in IO events it just doesn't need to know about.... > Another thing that might be good to do is provide an administrative > interface to iterate and abort pending fanotify permission/pre-content > events. That was generally something the DMAPI event consumer provided. > You must have noticed the overlap between my old persistent > change tracking journal and this design. The referenced branch > is from that old POC. > > I do believe that the use cases somewhat overlap and that the > same building blocks could be used to implement a persistent > change journal in userspace as you suggested back then. That's a very different use case and set of requirements to a HSM. A HSM tracks much, much larger amounts of data than a persistent change journal. We had [C]XFS-DMAPI based HSMs running SLES in production that tracked half a billion inodes and > 10PB of data 15 years ago. These days I'd expect "exabyte" to be the unit of storage that large HSMs are storing. > Thoughts? I think that if the goal is to support HSMs with fanotify, we first need to think about how we efficiently support all the functionality HSMs require rather than just focus on a blocking fanotify read operation. We don't need to implement everything, but at least having a plan for things like handling the event filtering requirements without the HSM having to walk the entire filesystem and inject per-inode event filters after every mount would be a real good idea.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-21 23:27 ` Dave Chinner @ 2022-09-22 4:35 ` Amir Goldstein 2022-09-23 7:57 ` Dave Chinner 0 siblings, 1 reply; 43+ messages in thread From: Amir Goldstein @ 2022-09-22 4:35 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, Plaster, Robert On Thu, Sep 22, 2022 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Sep 11, 2022 at 09:12:06PM +0300, Amir Goldstein wrote: > > Hi Jan, > > > > I wanted to consult with you about preliminary design thoughts > > for implementing a hierarchical storage manager (HSM) > > with fanotify. > > > > I have been in contact with some developers in the past > > who were interested in using fanotify to implement HSM > > (to replace old DMAPI implementation). > > > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > > should be enough to implement a basic HSM, but it is not > > sufficient for implementing more advanced HSM features. > > Ah, I wondered where the people with that DMAPI application went all > those years ago after I told them they should look into using > fanotify to replace the dependency they had on the DMAPI patched > XFS that SGI maintained for years for SLES kernels... > Indeed. Robert has told that the fanotify HSM code would be uploaded to github in the near future: https://deepspacestorage.com/resources/#downloads > > Some of the HSM feature that I would like are: > > - blocking hook before access to file range and fill that range > > - blocking hook before lookup of child and optionally create child > > Ok, so these are to replace the DMAPI hooks that provided a blocking > userspace upcall to the HSM to allow it fetch data from offline > teirs that wasn't currently in the filesystem itself. i.e. the inode > currently has a hole over that range of data, but before the read > can proceed the HSM needs to retreive the data from the remote > storage and write it into the local filesystem. > > I think that you've missed a bunch of blocking notifications that > are needed, though. e.g. truncate needs to block while the HSM > records the file ranges it is storing offline are now freed. > fallocate() needs to block while it waits for the HSM to tell it the > ranges of the file that actually contain data and so should need to > be taken into account. (e.g. ZERO_RANGE needs to wait for offline > data to be invalidated, COLLAPSE_RANGE needs offline data to be > recalled just like a read() operation, etc). > > IOWs, any operation that manipulates the extent map or the data in > the file needs a blocking upcall to the HSM so that it can restore > and invalidate the offline data across the range of the operation > that is about to be performed.... The event FAN_PRE_MODIFY mentioned below is destined to take care of those cases. It is destined to be called from security_file_permission(MAY_WRITE) => fsnotify_perm() which covers all the cases that you mentioned. > > > My thoughts on the UAPI were: > > - Allow new combination of FAN_CLASS_PRE_CONTENT > > and FAN_REPORT_FID/DFID_NAME > > - This combination does not allow any of the existing events > > in mask > > - It Allows only new events such as FAN_PRE_ACCESS > > FAN_PRE_MODIFY and FAN_PRE_LOOKUP > > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have > > optional file range info > > - All the FAN_PRE_ events are called outside vfs locks and > > specifically before sb_writers lock as in my fsnotify_pre_modify [1] > > POC > > > > That last part is important because the HSM daemon will > > need to make modifications to the accessed file/directory > > before allowing the operation to proceed. > > Yes, and that was the biggest problem with DMAPI - the locking > involved. DMAPI operations have to block without holding any locks > that the IO path, truncate, fallocate, etc might need, but once they > are unblocked they need to regain those locks to allow the operation > to proceed. This was by far the ugliest part of the DMAPI patches, > and ultimately, the reason why it was never merged. > Part of the problem is that the DMAPI hooks were inside fs code. My intention for fanotify blocking hooks to be in vfs before taking any locks as much as possible. I already have a poc project that added those pre-modify hooks for the change tracking journal thing. For most of the syscalls, security_file_permission(MAY_WRITE) is called before taking vfs locks (sb_writers in particular), except for dedupe and clone and that one is my fault - 031a072a0b8a ("vfs: call vfs_clone_file_range() under freeze protection"), so I'll need to deal with it. > > Naturally that opens the possibility for new userspace > > deadlocks. Nothing that is not already possible with permission > > event, but maybe deadlocks that are more inviting to trip over. > > > > I am not sure if we need to do anything about this, but we > > could make it easier to ignore events from the HSM daemon > > itself if we want to, to make the userspace implementation easier. > > XFS used "invisible IO" as the mechanism for avoiding sending DMAPI > events for operations that we initiated by the HSM to move data into > and out of the filesystem. > > No doubt you've heard us talk about invisible IO in the past - > O_NOCMTIME is what that invisible IO has eventually turned into in a > modern Linux kernel. We still use that for invisible IO - xfs_fsr > uses it for moving data around during online defragmentation. The > entire purpose of invisible IO was to provide a path for HSMs and > userspace bulk data movers (e.g. HSM aware backup tools like > xfsdump) to do IO without generating unnecessary or recursive DMAPI > events.... > > IOWs, if we want a DMAPI replacement, we will need to formalise a > method of performing syscall based operations that will not trigger > HSM notification events. > This concept already exists in fanotify. This is what FMODE_NONOTIFY is for. The event->fd handed in FAN_ACCESS_PERM event can be use to read the file without triggering a recursive event. This is how Anti-Malware products scan files. I have extended that concept in my POC patch to avoid recursive FAN_LOOKUP_PERM event when calling Xat() syscall with a dirfd with FMODE_NONOTIFY: https://github.com/amir73il/linux/commits/fanotify_pre_content > The other thing that XFS had for DMAPI was persistent storage in the > inode of the event mask that inode should report events for. See > the di_dmevmask and di_dmstate fields defined in the on-disk inode > format here: > > https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc > > There's no detail for them, but the event mask indicated what DMAPI > events the inode should issue notifications for, and the state field > held information about DMAPI operations in progress. > > The event field is the important one here - if the event field was > empty, access to the inode never generated DMAPI events. When the > HSM moved data offline, the "READ" event mask bit was set by the HSM > and that triggered DMAPI events for any operation that needed to > read data or manipulate the extent map. When the data was brought > entirely back online, the event masks count be cleared. > HSM can and should manage this persistent bit, but it does not have to be done using a special ioctl. We already use xattr and/or fileattr flags in our HSM implementation. I've mentioned this in my reply to Jan. The way I intend to address this "persistent READ hook" in fanotify side is by attaching an HSM specific BFP program to an fanotify filesystem mark. We could standartize a fileattr flag just the same as NODUMP was for the use of backup applications (e.g. NODATA), but it is not a prerequite, just a standard way for HSM to set the persistent READ hook bit. An HSM product could be configured to reappropriate NODUMP for that matter or check for no READ bits in st_mode or xattr. > However, DMAPI also supports dual state operation, where the > data in the local filesystem is also duplicated in the offline > storage (e.g. immediately after a recall operation). This state can > persist until data or layout is changed in the local filesystem, > and so there's a "WRITE" event mask as well that allows the > filesystem to inform the HSM that data it may have in offline > storage is being changed. > > The state field is there to tell the HSM that an operation was in > progress when the system crashed. As part of recovery, the HSM needs > to find all the inodes that had DM operations in progress and either > complete them or revert them to bring everything back to a > consistent state. THe SGI HSMs used the bulkstat interfaces to scan > the fs and find inodes that had a non-zero DM state field. This is > one of the reasons that having bulkstat scale out to scanning > millions of inodes a second ends up being important - coherency > checking between the ondisk filesystem state and the userspace > offline data tracking databases is a very important admin > operation.. > Normally, HSM will be listening on a filesystem mark to async FAN_MODIFY and FAN_CLOSE_WRITE events. To cover the case of crash and missing fanotify events, we use the persistent change tracking journal. My current prototype is in overlayfs driver using pre-modify fsnotify hooks, as we discussed back in LSFMM 2018: https://github.com/amir73il/overlayfs/wiki/Overlayfs-watch The idea is to export those pre-modify hooks via fanotify and move this implementation from the overlayfs driver to userspace HSM daemon. Note that the name "Change Tracking Journal" may be confusing - My implementation does not store a time sequence of events like the NTFS Change Journal - it only stores a map of file handles of directories containing new/changed/deleted files. Iterating this "Changed dirs map" is way faster then itereating bulkstat of all inodes and looking for the WRITE bit. The responsibility of maintaining per file "dirty" state is on HSM and it can be done using the change tracking journal and an external database. Filesystem provided features such as ctime and iversion can be used to optimize the management of "dirty" state, but they are not a prerequisite and most of the time the change journal is sufficient to be able to scale up, because it can give you the answer to the question: "In which of the multi million dirs, do I need to look for changes to be synced to secondary storage?". > The XFS dmapi event and state mask control APIs are now deprecated. > The XFS_IOC_FSSETDM ioctl could read and write the values, and the > the XFS V1 bulkstat ioctl could read them. There were also flags for > things like extent mapping ioctls (FIEMAP equivalent) that ensured > looking at the extent map didn't trigger DMAPI events and data > recall. > > I guess what I'm trying to say is that there's a lot more to an > efficient implementation of a HSM event notification mechanism than > just implementing a couple of blocking upcalls. IF we want something > that will replace even simple DMAPI-based HSM use cases, we really > need to think through how to support all the operations that a > recall operation might needed for and hence have to block. ANd we > really should think about how to efficiently filter out unnecessary > events so that we don't drown the HSM in IO events it just doesn't > need to know about.... > Thinking about efficient HSM implementation and testing prototypes is what I have been doing for the past 6 years in CTERA. My thoughts and design for fanotify HSM can be backed up with several successful prototypes that have been deployed on large scale customer environments where both high performance and reliable backups are a very hard requirements. > > Another thing that might be good to do is provide an administrative > > interface to iterate and abort pending fanotify permission/pre-content > > events. > > That was generally something the DMAPI event consumer provided. > > > You must have noticed the overlap between my old persistent > > change tracking journal and this design. The referenced branch > > is from that old POC. > > > > I do believe that the use cases somewhat overlap and that the > > same building blocks could be used to implement a persistent > > change journal in userspace as you suggested back then. > > That's a very different use case and set of requirements to a HSM. > > A HSM tracks much, much larger amounts of data than a persistent > change journal. We had [C]XFS-DMAPI based HSMs running SLES in > production that tracked half a billion inodes and > 10PB of data 15 > years ago. These days I'd expect "exabyte" to be the unit of > storage that large HSMs are storing. > > > Thoughts? > > I think that if the goal is to support HSMs with fanotify, we first > need to think about how we efficiently support all the functionality > HSMs require rather than just focus on a blocking fanotify read > operation. We don't need to implement everything, but at least > having a plan for things like handling the event filtering > requirements without the HSM having to walk the entire filesystem > and inject per-inode event filters after every mount would be a real > good idea.... > All of the above have been considered and I have mentioned the proposed solution in the thread. I may post some partial RFC patches to hash out some implementation details along the way (e.g. for FAN_LOOKUP_PERM), but when I send out the final proposal it will include all the fanotify extensions required to implement a fully functional and performant HSM. Given the number and lengths of Q&A in this thread I am probably going to summarize this discussion in a wiki to send along with the proposal for fanotify HSM API. Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-22 4:35 ` Amir Goldstein @ 2022-09-23 7:57 ` Dave Chinner 2022-09-23 11:22 ` Amir Goldstein 0 siblings, 1 reply; 43+ messages in thread From: Dave Chinner @ 2022-09-23 7:57 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Plaster, Robert On Thu, Sep 22, 2022 at 07:35:39AM +0300, Amir Goldstein wrote: > On Thu, Sep 22, 2022 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Sun, Sep 11, 2022 at 09:12:06PM +0300, Amir Goldstein wrote: > > > Hi Jan, > > > > > > I wanted to consult with you about preliminary design thoughts > > > for implementing a hierarchical storage manager (HSM) > > > with fanotify. > > > > > > I have been in contact with some developers in the past > > > who were interested in using fanotify to implement HSM > > > (to replace old DMAPI implementation). > > > > > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > > > should be enough to implement a basic HSM, but it is not > > > sufficient for implementing more advanced HSM features. > > > > Ah, I wondered where the people with that DMAPI application went all > > those years ago after I told them they should look into using > > fanotify to replace the dependency they had on the DMAPI patched > > XFS that SGI maintained for years for SLES kernels... > > > > Indeed. Robert has told that the fanotify HSM code would be uploaded > to github in the near future: > > https://deepspacestorage.com/resources/#downloads > > > > Some of the HSM feature that I would like are: > > > - blocking hook before access to file range and fill that range > > > - blocking hook before lookup of child and optionally create child > > > > Ok, so these are to replace the DMAPI hooks that provided a blocking > > userspace upcall to the HSM to allow it fetch data from offline > > teirs that wasn't currently in the filesystem itself. i.e. the inode > > currently has a hole over that range of data, but before the read > > can proceed the HSM needs to retreive the data from the remote > > storage and write it into the local filesystem. > > > > I think that you've missed a bunch of blocking notifications that > > are needed, though. e.g. truncate needs to block while the HSM > > records the file ranges it is storing offline are now freed. > > fallocate() needs to block while it waits for the HSM to tell it the > > ranges of the file that actually contain data and so should need to > > be taken into account. (e.g. ZERO_RANGE needs to wait for offline > > data to be invalidated, COLLAPSE_RANGE needs offline data to be > > recalled just like a read() operation, etc). > > > > IOWs, any operation that manipulates the extent map or the data in > > the file needs a blocking upcall to the HSM so that it can restore > > and invalidate the offline data across the range of the operation > > that is about to be performed.... > > The event FAN_PRE_MODIFY mentioned below is destined to > take care of those cases. > It is destined to be called from > security_file_permission(MAY_WRITE) => fsnotify_perm() > which covers all the cases that you mentioned. That doesn't provide IO ranges to userspace, so does that mean write(1 byte) might trigger the recall of an entire TB sized file from offline storage? How do you support partial file migration with a scheme like this? > > > My thoughts on the UAPI were: > > > - Allow new combination of FAN_CLASS_PRE_CONTENT > > > and FAN_REPORT_FID/DFID_NAME > > > - This combination does not allow any of the existing events > > > in mask > > > - It Allows only new events such as FAN_PRE_ACCESS > > > FAN_PRE_MODIFY and FAN_PRE_LOOKUP > > > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have > > > optional file range info > > > - All the FAN_PRE_ events are called outside vfs locks and > > > specifically before sb_writers lock as in my fsnotify_pre_modify [1] > > > POC > > > > > > That last part is important because the HSM daemon will > > > need to make modifications to the accessed file/directory > > > before allowing the operation to proceed. > > > > Yes, and that was the biggest problem with DMAPI - the locking > > involved. DMAPI operations have to block without holding any locks > > that the IO path, truncate, fallocate, etc might need, but once they > > are unblocked they need to regain those locks to allow the operation > > to proceed. This was by far the ugliest part of the DMAPI patches, > > and ultimately, the reason why it was never merged. > > Part of the problem is that the DMAPI hooks were inside fs code. > My intention for fanotify blocking hooks to be in vfs before taking any > locks as much as possible. > > I already have a poc project that added those pre-modify hooks > for the change tracking journal thing. > > For most of the syscalls, security_file_permission(MAY_WRITE) is > called before taking vfs locks (sb_writers in particular), except for > dedupe and clone and that one is my fault - 031a072a0b8a > ("vfs: call vfs_clone_file_range() under freeze protection"), so > I'll need to deal with it. This looks like a farly large TOCTOU race condition. i.e. what's to prevent a "move offline" operation run by a userspace HSM agent racing with an application read that has done it's FAN_PRE_MODIFY call but hasn't yet started/completed it's physical read from the filesystem? This sort of race can result in the data in the file being migrated and punched out by the HSM before the read gets the i_rwsem or page locks it needs to read the data into cache.... What am I missing here? How does this "no filesystem locks" notification scheme avoid these sorts of spurious user data corruption events from occurring? > > > Naturally that opens the possibility for new userspace > > > deadlocks. Nothing that is not already possible with permission > > > event, but maybe deadlocks that are more inviting to trip over. > > > > > > I am not sure if we need to do anything about this, but we > > > could make it easier to ignore events from the HSM daemon > > > itself if we want to, to make the userspace implementation easier. > > > > XFS used "invisible IO" as the mechanism for avoiding sending DMAPI > > events for operations that we initiated by the HSM to move data into > > and out of the filesystem. > > > > No doubt you've heard us talk about invisible IO in the past - > > O_NOCMTIME is what that invisible IO has eventually turned into in a > > modern Linux kernel. We still use that for invisible IO - xfs_fsr > > uses it for moving data around during online defragmentation. The > > entire purpose of invisible IO was to provide a path for HSMs and > > userspace bulk data movers (e.g. HSM aware backup tools like > > xfsdump) to do IO without generating unnecessary or recursive DMAPI > > events.... > > > > IOWs, if we want a DMAPI replacement, we will need to formalise a > > method of performing syscall based operations that will not trigger > > HSM notification events. > > This concept already exists in fanotify. > This is what FMODE_NONOTIFY is for. > The event->fd handed in FAN_ACCESS_PERM event can be use to > read the file without triggering a recursive event. > This is how Anti-Malware products scan files. So all that you need is to enable invisible IO for the data migration operations, then? BTW, where does the writable fd that is passed to the userspace event that the HSM can use to do IO to/from the inode come from? How does fanotify guarantee that the recipient has the necessary permissions to read/write to the file it represents? > I have extended that concept in my POC patch to avoid recursive > FAN_LOOKUP_PERM event when calling Xat() syscall with > a dirfd with FMODE_NONOTIFY: > > https://github.com/amir73il/linux/commits/fanotify_pre_content > > > The other thing that XFS had for DMAPI was persistent storage in the > > inode of the event mask that inode should report events for. See > > the di_dmevmask and di_dmstate fields defined in the on-disk inode > > format here: > > > > https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc > > > > There's no detail for them, but the event mask indicated what DMAPI > > events the inode should issue notifications for, and the state field > > held information about DMAPI operations in progress. > > > > The event field is the important one here - if the event field was > > empty, access to the inode never generated DMAPI events. When the > > HSM moved data offline, the "READ" event mask bit was set by the HSM > > and that triggered DMAPI events for any operation that needed to > > read data or manipulate the extent map. When the data was brought > > entirely back online, the event masks count be cleared. > > > > HSM can and should manage this persistent bit, but it does not > have to be done using a special ioctl. > We already use xattr and/or fileattr flags in our HSM implementation. > > I've mentioned this in my reply to Jan. > The way I intend to address this "persistent READ hook" in > fanotify side is by attaching an HSM specific BFP program to > an fanotify filesystem mark. How is that fanotify mark and BPF hook set? What triggers it? What makes it persistent? Why does it need a BPF program? What is the program actually intended to do? You're making big assumptions that people actually know what you are trying to implement, also how and why. Perhaps you should write a design overview that connects all the dots and explains why things like custom BPF programs are necessary... > We could standartize a fileattr flag just the same as NODUMP > was for the use of backup applications (e.g. NODATA), but it > is not a prerequite, just a standard way for HSM to set the > persistent READ hook bit. So you do actually need the filesystem to store information about HSM events that need to be generated, but then all the events are generated at the VFS without holding filesysetm locks? That seems like it could be racy, too. > An HSM product could be configured to reappropriate NODUMP > for that matter or check for no READ bits in st_mode or xattr. You know better than to try to redefine the meaning of a well known attribute that has been exposed to userspace for 25+ years and is widely used... > > However, DMAPI also supports dual state operation, where the > > data in the local filesystem is also duplicated in the offline > > storage (e.g. immediately after a recall operation). This state can > > persist until data or layout is changed in the local filesystem, > > and so there's a "WRITE" event mask as well that allows the > > filesystem to inform the HSM that data it may have in offline > > storage is being changed. > > > > The state field is there to tell the HSM that an operation was in > > progress when the system crashed. As part of recovery, the HSM needs > > to find all the inodes that had DM operations in progress and either > > complete them or revert them to bring everything back to a > > consistent state. THe SGI HSMs used the bulkstat interfaces to scan > > the fs and find inodes that had a non-zero DM state field. This is > > one of the reasons that having bulkstat scale out to scanning > > millions of inodes a second ends up being important - coherency > > checking between the ondisk filesystem state and the userspace > > offline data tracking databases is a very important admin > > operation.. > > > > Normally, HSM will be listening on a filesystem mark to async > FAN_MODIFY and FAN_CLOSE_WRITE events. > > To cover the case of crash and missing fanotify events, we use > the persistent change tracking journal. > My current prototype is in overlayfs driver using pre-modify > fsnotify hooks, as we discussed back in LSFMM 2018: > > https://github.com/amir73il/overlayfs/wiki/Overlayfs-watch Yup, which lead to the conclusion that the only way it could work was "synchronous write of the change intent to the change journal before the change is made". > The idea is to export those pre-modify hooks via fanotify > and move this implementation from the overlayfs driver to > userspace HSM daemon. > > Note that the name "Change Tracking Journal" may be confusing - > My implementation does not store a time sequence of events like > the NTFS Change Journal - it only stores a map of file handles of > directories containing new/changed/deleted files. > > Iterating this "Changed dirs map" is way faster then itereating > bulkstat of all inodes and looking for the WRITE bit. Maybe so, but you haven't ever provided data to indicate what sort of filesystem modification rate this change journal can sustain. Just from my understanding of what it needs to do (sync updates), I can't see how it could be implemented without affecting overall runtime file and directory modification rates adversely. At which point I ponder what is worse - always taking the runtime overhead for the change journal operations to be propagated to the HSM to record persistently and then replaying that journal after a mount, or the HSM just needing to run a bulkstat scan in the background after mount to do a coherency check and resolution scan? > The responsibility of maintaining per file "dirty" state is on HSM > and it can be done using the change tracking journal and an > external database. Filesystem provided features such as ctime > and iversion can be used to optimize the management of "dirty" > state, but they are not a prerequisite and most of the time the > change journal is sufficient to be able to scale up, because it > can give you the answer to the question: > "In which of the multi million dirs, do I need to look for changes > to be synced to secondary storage?". Sure, it's not every file that gets recorded, but it still sounds like this could be a significant percentage of the filesystem contents that need scanning. And the HSM would still have to scan them by the directory structure operations, which typically costs about 3-4x as much IO, CPU and wall time overhead per scanned inode compared to bulkstat. Please note that I'm not saying that either mechanism is better, nor am I saying that bulkstat scanning is a perfect solution, just that I've got zero information about how this change journal performs and what impact it has on runtime operation. Bulkstat scans, OTOH, I understand intimately and especially in the context of HSM operations. i.e. I know that scanning a billion inodes only takes a few minutes of wall time with modern SSDs and CPUs... > > The XFS dmapi event and state mask control APIs are now deprecated. > > The XFS_IOC_FSSETDM ioctl could read and write the values, and the > > the XFS V1 bulkstat ioctl could read them. There were also flags for > > things like extent mapping ioctls (FIEMAP equivalent) that ensured > > looking at the extent map didn't trigger DMAPI events and data > > recall. > > > > I guess what I'm trying to say is that there's a lot more to an > > efficient implementation of a HSM event notification mechanism than > > just implementing a couple of blocking upcalls. IF we want something > > that will replace even simple DMAPI-based HSM use cases, we really > > need to think through how to support all the operations that a > > recall operation might needed for and hence have to block. ANd we > > really should think about how to efficiently filter out unnecessary > > events so that we don't drown the HSM in IO events it just doesn't > > need to know about.... > > Thinking about efficient HSM implementation and testing prototypes is > what I have been doing for the past 6 years in CTERA. Great! It sounds like you'll have no trouble documenting the design to teach us all how you've solved these problems so we understand what you are asking us to review... :) -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: thoughts about fanotify and HSM 2022-09-23 7:57 ` Dave Chinner @ 2022-09-23 11:22 ` Amir Goldstein 0 siblings, 0 replies; 43+ messages in thread From: Amir Goldstein @ 2022-09-23 11:22 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, Plaster, Robert w On Fri, Sep 23, 2022 at 10:57 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 22, 2022 at 07:35:39AM +0300, Amir Goldstein wrote: > > On Thu, Sep 22, 2022 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Sun, Sep 11, 2022 at 09:12:06PM +0300, Amir Goldstein wrote: > > > > Hi Jan, > > > > > > > > I wanted to consult with you about preliminary design thoughts > > > > for implementing a hierarchical storage manager (HSM) > > > > with fanotify. > > > > > > > > I have been in contact with some developers in the past > > > > who were interested in using fanotify to implement HSM > > > > (to replace old DMAPI implementation). > > > > > > > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > > > > should be enough to implement a basic HSM, but it is not > > > > sufficient for implementing more advanced HSM features. > > > > > > Ah, I wondered where the people with that DMAPI application went all > > > those years ago after I told them they should look into using > > > fanotify to replace the dependency they had on the DMAPI patched > > > XFS that SGI maintained for years for SLES kernels... > > > > > > > Indeed. Robert has told that the fanotify HSM code would be uploaded > > to github in the near future: > > > > https://deepspacestorage.com/resources/#downloads > > > > > > Some of the HSM feature that I would like are: > > > > - blocking hook before access to file range and fill that range > > > > - blocking hook before lookup of child and optionally create child > > > > > > Ok, so these are to replace the DMAPI hooks that provided a blocking > > > userspace upcall to the HSM to allow it fetch data from offline > > > teirs that wasn't currently in the filesystem itself. i.e. the inode > > > currently has a hole over that range of data, but before the read > > > can proceed the HSM needs to retreive the data from the remote > > > storage and write it into the local filesystem. > > > > > > I think that you've missed a bunch of blocking notifications that > > > are needed, though. e.g. truncate needs to block while the HSM > > > records the file ranges it is storing offline are now freed. > > > fallocate() needs to block while it waits for the HSM to tell it the > > > ranges of the file that actually contain data and so should need to > > > be taken into account. (e.g. ZERO_RANGE needs to wait for offline > > > data to be invalidated, COLLAPSE_RANGE needs offline data to be > > > recalled just like a read() operation, etc). > > > > > > IOWs, any operation that manipulates the extent map or the data in > > > the file needs a blocking upcall to the HSM so that it can restore > > > and invalidate the offline data across the range of the operation > > > that is about to be performed.... > > > > The event FAN_PRE_MODIFY mentioned below is destined to > > take care of those cases. > > It is destined to be called from > > security_file_permission(MAY_WRITE) => fsnotify_perm() > > which covers all the cases that you mentioned. > > That doesn't provide IO ranges to userspace, so does that mean > write(1 byte) might trigger the recall of an entire TB sized file > from offline storage? > > How do you support partial file migration with a scheme like this? That would be handled by: - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have optional file range info In the API extension proposal below > > > > > My thoughts on the UAPI were: > > > > - Allow new combination of FAN_CLASS_PRE_CONTENT > > > > and FAN_REPORT_FID/DFID_NAME > > > > - This combination does not allow any of the existing events > > > > in mask > > > > - It Allows only new events such as FAN_PRE_ACCESS > > > > FAN_PRE_MODIFY and FAN_PRE_LOOKUP > > > > - FAN_PRE_ACCESS and FAN_PRE_MODIFY can have > > > > optional file range info > > > > - All the FAN_PRE_ events are called outside vfs locks and > > > > specifically before sb_writers lock as in my fsnotify_pre_modify [1] > > > > POC > > > > > > > > That last part is important because the HSM daemon will > > > > need to make modifications to the accessed file/directory > > > > before allowing the operation to proceed. > > > > > > Yes, and that was the biggest problem with DMAPI - the locking > > > involved. DMAPI operations have to block without holding any locks > > > that the IO path, truncate, fallocate, etc might need, but once they > > > are unblocked they need to regain those locks to allow the operation > > > to proceed. This was by far the ugliest part of the DMAPI patches, > > > and ultimately, the reason why it was never merged. > > > > Part of the problem is that the DMAPI hooks were inside fs code. > > My intention for fanotify blocking hooks to be in vfs before taking any > > locks as much as possible. > > > > I already have a poc project that added those pre-modify hooks > > for the change tracking journal thing. > > > > For most of the syscalls, security_file_permission(MAY_WRITE) is > > called before taking vfs locks (sb_writers in particular), except for > > dedupe and clone and that one is my fault - 031a072a0b8a > > ("vfs: call vfs_clone_file_range() under freeze protection"), so > > I'll need to deal with it. > > This looks like a farly large TOCTOU race condition. i.e. what's to > prevent a "move offline" operation run by a userspace HSM agent > racing with an application read that has done it's FAN_PRE_MODIFY > call but hasn't yet started/completed it's physical read from the > filesystem? This sort of race can result in the data in the file > being migrated and punched out by the HSM before the read gets the > i_rwsem or page locks it needs to read the data into cache.... > > What am I missing here? How does this "no filesystem locks" > notification scheme avoid these sorts of spurious user data > corruption events from occurring? > See my answer to Jan to the same question. HSM will take a write lease on the file before punching out a file. It is an existing API that I do not even need to extend. I do need to POC this method though... > > > > Naturally that opens the possibility for new userspace > > > > deadlocks. Nothing that is not already possible with permission > > > > event, but maybe deadlocks that are more inviting to trip over. > > > > > > > > I am not sure if we need to do anything about this, but we > > > > could make it easier to ignore events from the HSM daemon > > > > itself if we want to, to make the userspace implementation easier. > > > > > > XFS used "invisible IO" as the mechanism for avoiding sending DMAPI > > > events for operations that we initiated by the HSM to move data into > > > and out of the filesystem. > > > > > > No doubt you've heard us talk about invisible IO in the past - > > > O_NOCMTIME is what that invisible IO has eventually turned into in a > > > modern Linux kernel. We still use that for invisible IO - xfs_fsr > > > uses it for moving data around during online defragmentation. The > > > entire purpose of invisible IO was to provide a path for HSMs and > > > userspace bulk data movers (e.g. HSM aware backup tools like > > > xfsdump) to do IO without generating unnecessary or recursive DMAPI > > > events.... > > > > > > IOWs, if we want a DMAPI replacement, we will need to formalise a > > > method of performing syscall based operations that will not trigger > > > HSM notification events. > > > > This concept already exists in fanotify. > > This is what FMODE_NONOTIFY is for. > > The event->fd handed in FAN_ACCESS_PERM event can be use to > > read the file without triggering a recursive event. > > This is how Anti-Malware products scan files. > > So all that you need is to enable invisible IO for the data > migration operations, then? Correct. > > BTW, where does the writable fd that is passed to the userspace > event that the HSM can use to do IO to/from the inode come from? How > does fanotify guarantee that the recipient has the necessary > permissions to read/write to the file it represents? > fanotify opens the file and installs it in the event reader process. That has always been like that. That's one of the reasons that fanotify requires CAP_SYS_ADMIN for subscribing for permission events, so IOW, only CAP_SYS_ADMIN can create fd for fanotify invisible IO and fanotify invisible lookup. > > I have extended that concept in my POC patch to avoid recursive > > FAN_LOOKUP_PERM event when calling Xat() syscall with > > a dirfd with FMODE_NONOTIFY: > > > > https://github.com/amir73il/linux/commits/fanotify_pre_content > > > > > The other thing that XFS had for DMAPI was persistent storage in the > > > inode of the event mask that inode should report events for. See > > > the di_dmevmask and di_dmstate fields defined in the on-disk inode > > > format here: > > > > > > https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc > > > > > > There's no detail for them, but the event mask indicated what DMAPI > > > events the inode should issue notifications for, and the state field > > > held information about DMAPI operations in progress. > > > > > > The event field is the important one here - if the event field was > > > empty, access to the inode never generated DMAPI events. When the > > > HSM moved data offline, the "READ" event mask bit was set by the HSM > > > and that triggered DMAPI events for any operation that needed to > > > read data or manipulate the extent map. When the data was brought > > > entirely back online, the event masks count be cleared. > > > > > > > HSM can and should manage this persistent bit, but it does not > > have to be done using a special ioctl. > > We already use xattr and/or fileattr flags in our HSM implementation. > > > > I've mentioned this in my reply to Jan. > > The way I intend to address this "persistent READ hook" in > > fanotify side is by attaching an HSM specific BFP program to > > an fanotify filesystem mark. > > How is that fanotify mark and BPF hook set? What triggers it? > What makes it persistent? Why does it need a BPF program? What is > the program actually intended to do? > Currently, fanotify is able to place a (non persistent) "mark" on an inode or a mount or a filesystem to subscribe for permission events. The BFP program is intended to be attached to a filesystem mark to filter in the kernel the events that the user gets. It is something we discussed and need in fanotify for other reasons as well (e.g. subtree filtering). When I am talking about a "persistent mark" on a file I do not mean that after reboot, access to a file will generate a permission event. What I mean is that after reboot, if the HSM is running and has placed a filesystem mark with a BFP filter on the filesystem, then it will get events on that file. So "persistent mark" can be just about anything that HSM chooses it to be, as long as a BFP program has a way to check that marker. > You're making big assumptions that people actually know what you are > trying to implement, also how and why. Perhaps you should write a > design overview that connects all the dots and explains why things > like custom BPF programs are necessary... Yes. I most definitely should. That will be my next step. And I will provide a demo HSM implementation with it. I just wanted to send out a heads up to see if there are any immediate reasons to shoot down the idea. > > > We could standartize a fileattr flag just the same as NODUMP > > was for the use of backup applications (e.g. NODATA), but it > > is not a prerequite, just a standard way for HSM to set the > > persistent READ hook bit. > > So you do actually need the filesystem to store information about > HSM events that need to be generated, but then all the events are > generated at the VFS without holding filesysetm locks? That seems > like it could be racy, too. > The filesystem is just a host for the flag/xattr. Setting that marker and clearing it is completely under the responsibility of HSM. "no filesystem locks" is referring only to the period that the user is handling the event. It does not mean that there are no locks and other synchronization involved when checking the mark. HSM will only set or remove that mark with exclusive write lease, so I do not think checking for the mark in a race free manner is going to be a problem. We'll see... > > An HSM product could be configured to reappropriate NODUMP > > for that matter or check for no READ bits in st_mode or xattr. > > You know better than to try to redefine the meaning of a well known > attribute that has been exposed to userspace for 25+ years and is > widely used... > I do. That means that I will not propose to reappropriate NODUMP in kernel code, but an HSM implementation can do whatever it wants to do, as long as we let it control the flag and check it with a BFP program. To be fair, even a backup application cannot know that it is the only backup application on the system, so it has always been the responsibility of the sysadmin (or the distro) to make sure that backup applications do not step on each other. I will definitely push for defining a new flag that is dedicated for HSM applications, but again, it would be the responsibility of the sysadmin (or the distro) to make sure that HSM applications do not step on each other. > > > However, DMAPI also supports dual state operation, where the > > > data in the local filesystem is also duplicated in the offline > > > storage (e.g. immediately after a recall operation). This state can > > > persist until data or layout is changed in the local filesystem, > > > and so there's a "WRITE" event mask as well that allows the > > > filesystem to inform the HSM that data it may have in offline > > > storage is being changed. > > > > > > The state field is there to tell the HSM that an operation was in > > > progress when the system crashed. As part of recovery, the HSM needs > > > to find all the inodes that had DM operations in progress and either > > > complete them or revert them to bring everything back to a > > > consistent state. THe SGI HSMs used the bulkstat interfaces to scan > > > the fs and find inodes that had a non-zero DM state field. This is > > > one of the reasons that having bulkstat scale out to scanning > > > millions of inodes a second ends up being important - coherency > > > checking between the ondisk filesystem state and the userspace > > > offline data tracking databases is a very important admin > > > operation.. > > > > > > > Normally, HSM will be listening on a filesystem mark to async > > FAN_MODIFY and FAN_CLOSE_WRITE events. > > > > To cover the case of crash and missing fanotify events, we use > > the persistent change tracking journal. > > My current prototype is in overlayfs driver using pre-modify > > fsnotify hooks, as we discussed back in LSFMM 2018: > > > > https://github.com/amir73il/overlayfs/wiki/Overlayfs-watch > > Yup, which lead to the conclusion that the only way it could work > was "synchronous write of the change intent to the change journal > before the change is made". > Yes, but this is not a performance killer, because the performance penalty is only paid on first modification to file in a directory The fast path (directory is already marked dirty) does not need any writing to the change tracking journal at all. This is something that I am confident about, because the overlayfs watch code is already in production for a long time and we have lots of performance regression tests for large scale setups (i.e. many concurrent users, many files and dirs). > > The idea is to export those pre-modify hooks via fanotify > > and move this implementation from the overlayfs driver to > > userspace HSM daemon. > > > > Note that the name "Change Tracking Journal" may be confusing - > > My implementation does not store a time sequence of events like > > the NTFS Change Journal - it only stores a map of file handles of > > directories containing new/changed/deleted files. > > > > Iterating this "Changed dirs map" is way faster then itereating > > bulkstat of all inodes and looking for the WRITE bit. > > Maybe so, but you haven't ever provided data to indicate what sort > of filesystem modification rate this change journal can sustain. > Just from my understanding of what it needs to do (sync updates), I > can't see how it could be implemented without affecting overall > runtime file and directory modification rates adversely. > This does seem like an impossible trade off. The key to getting it right is making sure that most of the directory modifications are not "first modifications" and that depends on the frequency of the change tracking snapshots. > At which point I ponder what is worse - always taking the runtime > overhead for the change journal operations to be propagated to the > HSM to record persistently and then replaying that journal after a > mount, or the HSM just needing to run a bulkstat scan in the > background after mount to do a coherency check and resolution scan? > My solution to that was "a little bit of both" and experience from the fields shows that there is a way to get to a good balance. Even with weekly change tracking snapshots, the worse case "scan at mount" times become much smaller and the once per week per dir performance penalty is in the noise. Essentially, the change tracking journal is most useful for users that have massive amounts of cold and rarely accessed files. Surely, a filesystem implementation of "smart inode iterator" that includes user defined filters (e.g. ctime > T) would have been a very efficient way to achieve the same thing, but I prefer a filesystem agnostic solution. > > The responsibility of maintaining per file "dirty" state is on HSM > > and it can be done using the change tracking journal and an > > external database. Filesystem provided features such as ctime > > and iversion can be used to optimize the management of "dirty" > > state, but they are not a prerequisite and most of the time the > > change journal is sufficient to be able to scale up, because it > > can give you the answer to the question: > > "In which of the multi million dirs, do I need to look for changes > > to be synced to secondary storage?". > > Sure, it's not every file that gets recorded, but it still sounds > like this could be a significant percentage of the filesystem > contents that need scanning. And the HSM would still have to scan > them by the directory structure operations, which typically costs > about 3-4x as much IO, CPU and wall time overhead per scanned inode > compared to bulkstat. > You are very much correct. There is another advantage to directory structure traversal. It helps HSM detect path changes, which is needed for some HSMs. If you take into account that you need to resolve the back path from directory inode then bulkstat loses big time to directory traversal. An important property of overlayfs watch that I did not get into but it is crucial to the HSM scanning stage - The journal does not only record the changed dir, but all of its ancestors as well, so the much more important question that the journal can answer is: "Are there any changed files under this (huge) tree?". Modifications to files with hardlinks or disconnected paths are trickier, but it's workable, because those are the corner cases. > Please note that I'm not saying that either mechanism is better, nor > am I saying that bulkstat scanning is a perfect solution, just that > I've got zero information about how this change journal performs and > what impact it has on runtime operation. Bulkstat scans, OTOH, I > understand intimately and especially in the context of HSM > operations. i.e. I know that scanning a billion inodes only takes a > few minutes of wall time with modern SSDs and CPUs... > I have also considered using bulkstat many times. In the end it's a matter of balance. Some HSMs/data sets/workloads may be better off with bulkstat, but then again, I prefer that at least a filesystem agnostic solution will be available. > > > The XFS dmapi event and state mask control APIs are now deprecated. > > > The XFS_IOC_FSSETDM ioctl could read and write the values, and the > > > the XFS V1 bulkstat ioctl could read them. There were also flags for > > > things like extent mapping ioctls (FIEMAP equivalent) that ensured > > > looking at the extent map didn't trigger DMAPI events and data > > > recall. > > > > > > I guess what I'm trying to say is that there's a lot more to an > > > efficient implementation of a HSM event notification mechanism than > > > just implementing a couple of blocking upcalls. IF we want something > > > that will replace even simple DMAPI-based HSM use cases, we really > > > need to think through how to support all the operations that a > > > recall operation might needed for and hence have to block. ANd we > > > really should think about how to efficiently filter out unnecessary > > > events so that we don't drown the HSM in IO events it just doesn't > > > need to know about.... > > > > Thinking about efficient HSM implementation and testing prototypes is > > what I have been doing for the past 6 years in CTERA. > > Great! > > It sounds like you'll have no trouble documenting the design to > teach us all how you've solved these problems so we understand what > you are asking us to review... :) > I will try to do even better - I plan to provide reference design prototype code, but first a wiki! Thanks, Amir. ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2022-11-23 15:16 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-11 18:12 thoughts about fanotify and HSM Amir Goldstein 2022-09-12 12:57 ` Jan Kara 2022-09-12 16:38 ` Amir Goldstein [not found] ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com> 2022-09-13 2:41 ` Amir Goldstein 2022-09-14 7:27 ` Amir Goldstein 2022-09-14 10:30 ` Jan Kara 2022-09-14 11:52 ` Amir Goldstein 2022-09-20 18:19 ` Amir Goldstein 2022-09-22 10:48 ` Jan Kara 2022-09-22 13:03 ` Amir Goldstein 2022-09-26 15:27 ` Jan Kara 2022-09-28 12:29 ` Amir Goldstein 2022-09-29 10:01 ` Jan Kara 2022-10-07 13:58 ` Amir Goldstein 2022-10-12 15:44 ` Jan Kara 2022-10-12 16:28 ` Amir Goldstein 2022-10-13 12:16 ` Amir Goldstein 2022-11-03 12:57 ` Jan Kara 2022-11-03 13:38 ` Amir Goldstein 2022-10-28 12:50 ` Amir Goldstein 2022-11-03 16:30 ` Jan Kara 2022-11-04 8:17 ` Amir Goldstein 2022-11-07 11:10 ` Jan Kara 2022-11-07 14:13 ` Amir Goldstein 2022-11-14 19:17 ` Jan Kara 2022-11-14 20:08 ` Amir Goldstein 2022-11-15 10:16 ` Jan Kara 2022-11-15 13:08 ` Amir Goldstein 2022-11-16 10:56 ` Jan Kara 2022-11-16 16:24 ` Amir Goldstein 2022-11-17 12:38 ` Amir Goldstein 2022-11-23 10:49 ` Jan Kara 2022-11-23 13:07 ` Amir Goldstein 2022-11-21 16:40 ` Amir Goldstein 2022-11-23 12:11 ` Jan Kara 2022-11-23 13:30 ` Amir Goldstein 2022-11-23 10:10 ` Jan Kara 2022-11-23 15:16 ` Amir Goldstein [not found] ` <BY5PR07MB6529795F49FB4E923AFCB062A3449@BY5PR07MB6529.namprd07.prod.outlook.com> 2022-09-14 9:29 ` Jan Kara 2022-09-21 23:27 ` Dave Chinner 2022-09-22 4:35 ` Amir Goldstein 2022-09-23 7:57 ` Dave Chinner 2022-09-23 11:22 ` Amir Goldstein
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.