All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: thoughts about fanotify and HSM
Date: Thu, 3 Nov 2022 17:30:45 +0100	[thread overview]
Message-ID: <20221103163045.fzl6netcffk23sxw@quack3> (raw)
In-Reply-To: <CAOQ4uxiNhnV0OWU-2SY_N0aY19UdMboR3Uivcr7EvS7zdd9jxw@mail.gmail.com>

On Fri 28-10-22 15:50:04, Amir Goldstein wrote:
> On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@suse.cz> wrote:
> >
> > > Questions:
> > > - What do you think about the direction this POC has taken so far?
> > > - Is there anything specific that you would like to see in the POC
> > >   to be convinced that this API will be useful?
> >
> > I think your POC is taking a good direction and your discussion with Dave
> > had made me more confident that this is all workable :). I liked your idea
> > of the wiki (or whatever form of documentation) that summarizes what we've
> > discussed in this thread. That would be actually pretty nice for future
> > reference.
> >
> 
> The current state of POC is that "populate of access" of both files
> and directories is working and "race free evict of file content" is also
> implemented (safely AFAIK).
> 
> The technique involving exclusive write lease is discussed at [1].
> In a nutshell, populate and evict synchronize on atomic i_writecount
> and this technique can be implemented with upstream UAPIs.

Not so much i_writecount AFAIU but the generic lease mechanism overall. But
yes, the currently existing APIs should be enough for your purposes.

> I did use persistent xattr marks for the POC, but this is not a must.
> Evictable inode marks would have worked just as well.

OK.

> Now I have started to work on persistent change tracking.
> For this, I have only kernel code, only lightly tested, but I did not
> prove yet that the technique is working.
> 
> The idea that I started to sketch at [2] is to alternate between two groups.
> 
> When a change is recorded, an evictable ignore mark will be added on the
> object.  To start recording changes from a new point in time
> (checkpoint), a new group will be created (with no ignore marks) and the
> old group will be closed.

So what I dislike about the scheme with handover between two groups is that
it is somewhat complex and furthermore requiring fs freezing for checkpoint
is going to be rather expensive (and may be problematic if persistent
change tracking is used by potentially many unpriviledged applications).

As a side note I think it will be quite useful to be able to request
checkpoint only for a subtree (e.g. some app may be interested only in a
particular subtree) and the scheme with two groups will make any
optimizations to benefit from such fact more difficult - either we create
new group without ignore marks and then have to re-record changes nobody
actually needs or we have to duplicate ignore marks which is potentially
expensive as well.

Let's think about the race:

> To clarify, the race that I am trying to avoid is:
> 1. group B got a pre modify event and recorded the change before time T
> 2. The actual modification is performed after time T
> 3. group A does not get a pre modify event, so does not record the change
>     in the checkpoint since T

AFAIU you are worried about:

Task T				Change journal		App

write(file)
  generate pre_modify event
				record 'file' as modified
							Request changes
							Records 'file' contents
  modify 'file' data

...
							Request changes
							Nothing changed but
App's view of 'file' is obsolete.

Can't we solve this by creating POST_WRITE async event and then use it like:

1) Set state to CHECKPOINT_PENDING
2) In state CHECKPOINT_PENDING we record all received modify events into a
   separate 'transition' stream.
3) Remove ignore marks we need to remove.
4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
   recorded to the new period.
5) Merge all events from 'transition' stream to both old and new period
   event streams.
6) Events get removed from the 'transition' stream only once we receive
   POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
   on crash recovery). This way some events from 'transition' stream may
   get merged to multiple period event streams if the checkpoints are
   frequent and writes take long.

This should avoid the above race, should be relatively lightweight, and
does not require major API extensions.

BTW, while thinking about this I was wondering: How are the applications
using persistent change journal going to deal with buffered vs direct IO? I
currently don't see a scheme that would not loose modifications for some
combinations...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2022-11-03 16:31 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-11 18:12 thoughts about fanotify and HSM Amir Goldstein
2022-09-12 12:57 ` Jan Kara
2022-09-12 16:38   ` Amir Goldstein
     [not found]     ` <BY5PR07MB652953061D3A2243F66F0798A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
2022-09-13  2:41       ` Amir Goldstein
2022-09-14  7:27     ` Amir Goldstein
2022-09-14 10:30       ` Jan Kara
2022-09-14 11:52         ` Amir Goldstein
2022-09-20 18:19           ` Amir Goldstein
2022-09-22 10:48             ` Jan Kara
2022-09-22 13:03               ` Amir Goldstein
2022-09-26 15:27                 ` Jan Kara
2022-09-28 12:29                   ` Amir Goldstein
2022-09-29 10:01                     ` Jan Kara
2022-10-07 13:58                       ` Amir Goldstein
2022-10-12 15:44                         ` Jan Kara
2022-10-12 16:28                           ` Amir Goldstein
2022-10-13 12:16                             ` Amir Goldstein
2022-11-03 12:57                               ` Jan Kara
2022-11-03 13:38                                 ` Amir Goldstein
2022-10-28 12:50               ` Amir Goldstein
2022-11-03 16:30                 ` Jan Kara [this message]
2022-11-04  8:17                   ` Amir Goldstein
2022-11-07 11:10                     ` Jan Kara
2022-11-07 14:13                       ` Amir Goldstein
2022-11-14 19:17                         ` Jan Kara
2022-11-14 20:08                           ` Amir Goldstein
2022-11-15 10:16                             ` Jan Kara
2022-11-15 13:08                               ` Amir Goldstein
2022-11-16 10:56                                 ` Jan Kara
2022-11-16 16:24                                   ` Amir Goldstein
2022-11-17 12:38                                     ` Amir Goldstein
2022-11-23 10:49                                       ` Jan Kara
2022-11-23 13:07                                         ` Amir Goldstein
2022-11-21 16:40                                     ` Amir Goldstein
2022-11-23 12:11                                       ` Jan Kara
2022-11-23 13:30                                         ` Amir Goldstein
2022-11-23 10:10                                     ` Jan Kara
2022-11-23 15:16                                       ` Amir Goldstein
     [not found]     ` <BY5PR07MB6529795F49FB4E923AFCB062A3449@BY5PR07MB6529.namprd07.prod.outlook.com>
2022-09-14  9:29       ` Jan Kara
2022-09-21 23:27 ` Dave Chinner
2022-09-22  4:35   ` Amir Goldstein
2022-09-23  7:57     ` Dave Chinner
2022-09-23 11:22       ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221103163045.fzl6netcffk23sxw@quack3 \
    --to=jack@suse.cz \
    --cc=amir73il@gmail.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.