Re: thoughts about fanotify and HSM

From: Amir Goldstein <amir73il@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: thoughts about fanotify and HSM
Date: Wed, 23 Nov 2022 17:16:23 +0200	[thread overview]
Message-ID: <CAOQ4uxg0hfuyQk3dBXfF2VTtfyOg5bD_NPrz0VOSFuVoA4vpuw@mail.gmail.com> (raw)
In-Reply-To: <20221123101021.7a65fgjop3d45ryq@quack3>

On Wed, Nov 23, 2022 at 12:10 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 16-11-22 18:24:06, Amir Goldstein wrote:
> > > > Why then give up on the POST_WRITE events idea?
> > > > Don't you think it could work?
> > >
> > > So as we are discussing, the POST_WRITE event is not useful when we want to
> > > handle crash safety. And if we have some other mechanism (like SRCU) which
> > > is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> > > I'm not against POST_WRITE, I just don't see much value in it if we have
> > > another mechanism to deal with events straddling checkpoint.
> > >
> >
> > Not sure I follow.
> >
> > I think that crash safety can be achieved also with PRE/POST_WRITE:
> > - PRE_WRITE records an intent to write in persistent snapshot T
> >   and add to in-memory map of in-progress writes of period T
> > - When "checkpoint T" starts, new PRE_WRITES are recorded in both
> >   T and T+1 persistent snapshots, but event is added only to
> >   in-memory map of in-progress writes of period T+1
> > - "checkpoint T" ends when all in-progress writes of T are completed
>
> So maybe I miss something but suppose the situation I was mentioning few
> emails earlier:
>
> PRE_WRITE for F                 -> F recorded as modified in T
> modify F
> POST_WRITE for F
>
> PRE_WRITE for F                 -> ignored because F is already marked as
>                                    modified
>
>                                 -> checkpoint T requested, modified files
>                                    reported, process modified files
> modify F
> --------- crash
>
> Now unless filesystem freeze or SRCU is part of checkpoint, we will never
> notify about the last modification to F. So I don't see how PRE +
> POST_WRITE alone can achieve crash safety...
>
> And if we use filesystem freeze or SRCU as part of checkpoint, then
> processing of POST_WRITE events does not give us anything new. E.g.
> synchronize_srcu() during checkpoing before handing out list of modified
> files makes sure all modifications to files for which PRE_MODIFY events
> were generated (and thus are listed as modified in checkpoint T) are
> visible for userspace.
>
> So am I missing some case where POST_WRITE would be more useful than SRCU?
> Because at this point I'd rather implement SRCU than POST_WRITE.
>

I tend to agree. Even if POST_WRITE can be done,
SRCU will be far better.

> > The trick with alternating snapshots "handover" is this
> > (perhaps I never explained it and I need to elaborate on the wiki [1]):
> >
> > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query
> >
> > The changed files query results need to include recorded changes in both
> > "finalizing" snapshot T and the new snapshot T+1 that was started in
> > the beginning of the query.
> >
> > Snapshot T MUST NOT be discarded until checkpoint/handover
> > is complete AND the query results that contain changes recorded
> > in T and T+1 snapshots have been consumed.
> >
> > When the consumer ACKs that the query results have been safely stored
> > or acted upon (I called this operation "bless" snapshot T+1) then and
> > only then can snapshot T be discarded.
> >
> > After snapshot T is discarded a new query will start snapshot T+2.
> > A changed files query result includes the id of the last blessed snapshot.
> >
> > I think this is more or less equivalent to the SRCU that you suggested,
> > but all the work is done in userspace at application level.
> >
> > If you see any problem with this scheme or don't understand it
> > please let me know and I will try to explain better.
>
> So until now I was imagining that query results will be returned like a one
> big memcpy. I.e. one off event where the "persistent log daemon" hands over
> the whole contents of checkpoint T to the client. Whatever happens with the
> returned data is the bussiness of the client, whatever happens with the
> checkpoint T records in the daemon is the daemon's bussiness. The model you
> seem to speak about here is somewhat different - more like readdir() kind
> of approach where client asks for access to checkpoint T data, daemon
> provides the data record by record (probably serving the data from its
> files on disk), and when the client is done and "closes" checkpoint T,
> daemon's records about checkpoint T can be erased. Am I getting it right?
>

Yes, something like that.
The query result (which is actually a recursive readdir) could be huge.
So it cannot really be returned as a blob, it must be steamed to consumers.

> This however seems somewhat orthogonal to the SRCU idea. SRCU essentially
> serves the only purpose - make sure that modifications to all files for
> which we have received PRE_WRITE event are visible in respective files.
>

Absolutely right.
Sorry for the noise, but at least you've learned one more thing
about my persistent change snapshots architecture ;-)

Thanks,
Amir.