* Re: fscache: Redesigning the on-disk cache
2021-03-08 9:13 ` David Howells
@ 2021-03-08 10:35 ` Amir Goldstein
2021-03-08 11:28 ` Metadata writtenback notification? -- was " David Howells
` (4 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Amir Goldstein @ 2021-03-08 10:35 UTC (permalink / raw)
To: David Howells
Cc: linux-cachefs, Jeff Layton, David Wysochanski,
Matthew Wilcox (Oracle),
J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi
On Mon, Mar 8, 2021 at 11:14 AM David Howells <dhowells@redhat.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> > > file for every whole operation (which may combine reads and writes).
> >
> > I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
> > to keep track of present data, it's hard to distinguish between an
> > invalid cached range and a valid "cached hole".
>
> I wasn't exactly intending to permit caching over NFS. That leads to fun
> making sure that the superblock you're caching isn't the one that has the
> cache in it.
>
> However, we will need to handle hole-punching being done on a cached netfs,
> even if that's just to completely invalidate the cache for that file.
>
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
>
> I can't use that for XFS, Ext4 or btrfs, I suspect. Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.
>
> > You didn't say much about crash consistency or durability requirements of the
> > cache. Since cachefiles only syncs the cache on shutdown, I guess you
> > rely on the hosting filesystem to provide the required ordering guarantees.
>
> There's an xattr on each file in the cache to record the state. I use this
> mark a cache file "open". If, when I look up a file, the file is marked open,
> it is just discarded at the moment.
>
> Now, there are two types of data stored in the cache: data that has to be
> stored as a single complete blob and is replaced as such (e.g. symlinks and
> AFS dirs) and data that might be randomly modified (e.g. regular files).
>
> For the former, I have code, though in yet another branch, that writes this in
> a tmpfile, sets the xattrs and then uses vfs_link(LINK_REPLACE) to cut over.
>
> For the latter, that's harder to do as it would require copying the data to
> the tmpfile before we're allowed to modify it. However, if it's possible to
> create a tmpfile that's a CoW version of a data file, I could go down that
> route.
>
> But after I've written and sync'd the data, I set the xattr to mark the file
> not open. At the moment I'm doing this too lazily, only doing it when a netfs
> file gets evicted or when the cache gets withdrawn, but I really need to add a
> queue of objects to be sealed as they're closed. The balance is working out
> how often to do the sealing as something like a shell script can do a lot of
> consecutive open/write/close ops.
>
You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
For example, xfs keeps the "LSN" on each inode, so once the transaction
with some LSN has been committed, all the relevant inodes, if not dirty, can
be declared as synced, without having to call fsync() on any file and without
having to force transaction commit or any IO at all.
Since fscache takes care of submitting the IO, and it shouldn't care about any
specific time that the data/metadata hits the disk(?), you can make use of the
existing periodic writeback and rolling transaction commit and only ever need
to wait for that to happen before marking cache files "closed".
There was a discussion about fsyncing a range of files on LSFMM [1].
In the last comment on the article dchinner argues why we already have that
API (and now also with io_uring(), but AFAIK, we do not have a useful
wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
[1] https://lwn.net/Articles/789024/
> > Anyway, how are those ordering requirements going to be handled when entire
> > indexing is in a file? You'd practically need to re-implement a filesystem
>
> Yes, the though has occurred to me too. I would be implementing a "simple"
> filesystem - and we have lots of those:-/. The most obvious solution is to
> use the backing filesystem's metadata - except that that's not possible.
>
> > journal or only write cache updates to a temp file that can be discarded at
> > any time?
>
> It might involve keeping a bitmap of "open" blocks. Those blocks get
> invalidated when the cache restarts. The simplest solution would be to wipe
> the entire cache in such a situation, but that goes against one of the
> important features I want out of it.
>
> Actually, a journal of open and closed blocks might be better, though all I
> really need to store for each block is a 32-bit number.
>
> It's a particular problem if I'm doing DIO to the data storage area but
> buffering the changes to the metadata. Further, the metadata and data might
> be on different media, just to add to the complexity.
>
> Another possibility is only to cull blocks when the parent file is culled.
> That probably makes more sense as, as long as the file is registered culled on
> disk first and I don't reuse the file slot too quickly, I can write to the
> data store before updating the metadata.
>
If I were you, I would try to avoid re-implementing a journaled filesystem or
a database for fscache and try to make use of crash consistency guarantees
that filesystems already provide.
Namely, use the data dependency already provided by temp files.
It doesn't need to be one temp file per cached file.
Always easier said than done ;-)
Thanks,
Amir.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache
2021-03-08 9:13 ` David Howells
2021-03-08 10:35 ` Amir Goldstein
@ 2021-03-08 11:28 ` David Howells
2021-03-08 22:32 ` Dave Chinner
2021-03-09 11:27 ` David Howells
2021-03-08 18:54 ` J. Bruce Fields
` (3 subsequent siblings)
5 siblings, 2 replies; 13+ messages in thread
From: David Howells @ 2021-03-08 11:28 UTC (permalink / raw)
To: Amir Goldstein
Cc: dhowells, linux-cachefs, Jeff Layton, David Wysochanski,
Matthew Wilcox (Oracle),
J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi
Amir Goldstein <amir73il@gmail.com> wrote:
> > But after I've written and sync'd the data, I set the xattr to mark the
> > file not open. At the moment I'm doing this too lazily, only doing it
> > when a netfs file gets evicted or when the cache gets withdrawn, but I
> > really need to add a queue of objects to be sealed as they're closed. The
> > balance is working out how often to do the sealing as something like a
> > shell script can do a lot of consecutive open/write/close ops.
>
> You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
> For example, xfs keeps the "LSN" on each inode, so once the transaction
> with some LSN has been committed, all the relevant inodes, if not dirty, can
> be declared as synced, without having to call fsync() on any file and without
> having to force transaction commit or any IO at all.
>
> Since fscache takes care of submitting the IO, and it shouldn't care about any
> specific time that the data/metadata hits the disk(?), you can make use of the
> existing periodic writeback and rolling transaction commit and only ever need
> to wait for that to happen before marking cache files "closed".
>
> There was a discussion about fsyncing a range of files on LSFMM [1].
> In the last comment on the article dchinner argues why we already have that
> API (and now also with io_uring(), but AFAIK, we do not have a useful
> wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
>
> [1] https://lwn.net/Articles/789024/
This sounds like an interesting idea. Actually, what I probably want is a
notification to say that a particular object has been completely sync'd to
disk, metadata and all.
I'm not sure that io_uring is particularly usable from within the kernel,
though.
> If I were you, I would try to avoid re-implementing a journaled filesystem or
> a database for fscache and try to make use of crash consistency guarantees
> that filesystems already provide.
> Namely, use the data dependency already provided by temp files.
> It doesn't need to be one temp file per cached file.
>
> Always easier said than done ;-)
Yes.
There are a number of considerations I have to deal with, and they're somewhat
at odds with each other:
(1) I need to record what data I have stored from a file.
(2) I need to record where I stored the data.
(3) I need to make sure that I don't see old data.
(4) I need to make sure that I don't see data in the wrong file.
(5) I need to make sure I lose as little as possible on a crash.
(6) I want to be able to record what changes were made in the event we're
disconnected from the server.
For my fscache-iter branch, (1) is done with a map in an xattr, but I only
cache up to 1G in a file at the moment; (2), (4) and, to some extent (5), are
handled by the backing fs; (3) is handled by tagging the file and storing
coherency data in in an xattr (though tmpfiles are used on full invalidation).
(6) is not yet supported.
For upstream, (1), (2), (4) and to some extent (5) are handled through the
backing fs. (3) is handled by storing coherency data in an xattr and
truncating the file on invalidation; (6) is not yet supported.
However, there are some performance problems are arising in my fscache-iter
branch:
(1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
setxattr).
(2) It's retaining a lot of open file structs on cache files. Cachefiles
opens the file when it's first asked to access it and retains that till
the cookie is relinquished or the cache withdrawn (the file* doesn't
contribute to ENFILE/EMFILE but it still eats memory).
I can mitigate this by closing much sooner, perhaps opening the file for
each operation - but at the cost of having to spend time doing more opens
and closes. What's in upstream gets away without having to do open/close
for reads because it calls readpage.
Alternatively, I can have a background file closer - which requires an
LRU queue. This could be combined with a file "sealer".
Deferred writeback on the netfs starting writes to the cache makes this
more interesting as I have to retain the interest on the cache object
beyond the netfs file being closed.
(3) Trimming excess data from the end of the cache file. The problem with
using DIO to write to the cache is that the write has to be rounded up to
a multiple of the backing fs DIO blocksize, but if the file is truncated
larger, that excess data now becomes part of the file.
Possibly it's sufficient to just clear the excess page space before
writing, but that doesn't necessarily stop a writable mmap from
scribbling on it.
(4) Committing outstanding cache metadata at cache withdrawal or netfs
unmount. I've previously mentioned this: it ends up with a whole slew of
synchronous metadata changes being committed to the cache in one go
(truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this
can take quite a long time. The cache needs to be more proactive in
getting stuff committed as it goes along.
(5) Attaching to an object requires a pathwalk to it (normally only two
steps) and then reading various xattrs on it - all synchronous, but can
be punted to a background threadpool.
Amongst the reasons I was considering moving to an index and a single datafile
is to replace the path-lookup step for each object and the xattr reads to
looking in a single file and to reduce the number of open files in the cache
at any one time to around four.
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache
2021-03-08 11:28 ` Metadata writtenback notification? -- was " David Howells
@ 2021-03-08 22:32 ` Dave Chinner
2021-03-08 23:20 ` Matthew Wilcox
2021-03-09 11:27 ` David Howells
1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2021-03-08 22:32 UTC (permalink / raw)
To: David Howells
Cc: Amir Goldstein, linux-cachefs, Jeff Layton, David Wysochanski,
Matthew Wilcox (Oracle),
J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi
On Mon, Mar 08, 2021 at 11:28:41AM +0000, David Howells wrote:
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > But after I've written and sync'd the data, I set the xattr to mark the
> > > file not open. At the moment I'm doing this too lazily, only doing it
> > > when a netfs file gets evicted or when the cache gets withdrawn, but I
> > > really need to add a queue of objects to be sealed as they're closed. The
> > > balance is working out how often to do the sealing as something like a
> > > shell script can do a lot of consecutive open/write/close ops.
> >
> > You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
> > For example, xfs keeps the "LSN" on each inode, so once the transaction
> > with some LSN has been committed, all the relevant inodes, if not dirty, can
> > be declared as synced, without having to call fsync() on any file and without
> > having to force transaction commit or any IO at all.
> >
> > Since fscache takes care of submitting the IO, and it shouldn't care about any
> > specific time that the data/metadata hits the disk(?), you can make use of the
> > existing periodic writeback and rolling transaction commit and only ever need
> > to wait for that to happen before marking cache files "closed".
> >
> > There was a discussion about fsyncing a range of files on LSFMM [1].
> > In the last comment on the article dchinner argues why we already have that
> > API (and now also with io_uring(), but AFAIK, we do not have a useful
> > wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> >
> > [1] https://lwn.net/Articles/789024/
>
> This sounds like an interesting idea. Actually, what I probably want is a
> notification to say that a particular object has been completely sync'd to
> disk, metadata and all.
This isn't hard to do yourself in the kernel. All it takes is a
workqueue to run vfs_fsync() calls asynchronously and for the work
to queue a local notification/wakeup when the fsync completes...
That's all aio_fsync() does - the notification it queues on
completion is the AIO completion event for userspace - so I think
you could do this in about 50 lines of code if you really needed
it...
> However, there are some performance problems are arising in my fscache-iter
> branch:
>
> (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
> setxattr).
Async pipelines using unbound workqueues are your friend.
>
> (2) It's retaining a lot of open file structs on cache files. Cachefiles
> opens the file when it's first asked to access it and retains that till
> the cookie is relinquished or the cache withdrawn (the file* doesn't
> contribute to ENFILE/EMFILE but it still eats memory).
Sounds similar to the problem that the NFSd open file cache solves.
(fs/nfsd/filecache.c)
> (3) Trimming excess data from the end of the cache file. The problem with
> using DIO to write to the cache is that the write has to be rounded up to
> a multiple of the backing fs DIO blocksize,
Actually, a multiple of the logical sector size of the backing
device behind the backing filesystem.
> but if the file is truncated
> larger, that excess data now becomes part of the file.
Keep the actual file size in your tracking xattr.
> Possibly it's sufficient to just clear the excess page space before
> writing, but that doesn't necessarily stop a writable mmap from
> scribbling on it.
We can't stop mmap from scribbling in it. All filesystems have this
problem, so to prevent data leaks we have to zero the post-eof tail
region on every write of the EOF block, anyway.
> (4) Committing outstanding cache metadata at cache withdrawal or netfs
> unmount. I've previously mentioned this: it ends up with a whole slew of
> synchronous metadata changes being committed to the cache in one go
> (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this
> can take quite a long time. The cache needs to be more proactive in
> getting stuff committed as it goes along.
Workqueues give you an easy mechanism for async dispatch and
concurrency for synchronous operations. This is a largely solved
problem...
> (5) Attaching to an object requires a pathwalk to it (normally only two
> steps) and then reading various xattrs on it - all synchronous, but can
> be punted to a background threadpool.
a.k.a. punting to a workqueue :)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache
2021-03-08 22:32 ` Dave Chinner
@ 2021-03-08 23:20 ` Matthew Wilcox
0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2021-03-08 23:20 UTC (permalink / raw)
To: Dave Chinner
Cc: David Howells, Amir Goldstein, linux-cachefs, Jeff Layton,
David Wysochanski, J. Bruce Fields, Christoph Hellwig,
Dave Chinner, Alexander Viro, linux-afs, Linux NFS Mailing List,
CIFS, ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel,
Miklos Szeredi
On Tue, Mar 09, 2021 at 09:32:47AM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 11:28:41AM +0000, David Howells wrote:
> > Possibly it's sufficient to just clear the excess page space before
> > writing, but that doesn't necessarily stop a writable mmap from
> > scribbling on it.
>
> We can't stop mmap from scribbling in it. All filesystems have this
> problem, so to prevent data leaks we have to zero the post-eof tail
> region on every write of the EOF block, anyway.
That's certainly one approach. Another would be to zero it during the I/O
completion handler. It depends whether you can trust the last writer or
not (eg what do we do with an isofs file that happens to contain garbage
after the last byte in the file?)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache
2021-03-08 11:28 ` Metadata writtenback notification? -- was " David Howells
2021-03-08 22:32 ` Dave Chinner
@ 2021-03-09 11:27 ` David Howells
1 sibling, 0 replies; 13+ messages in thread
From: David Howells @ 2021-03-09 11:27 UTC (permalink / raw)
To: Dave Chinner
Cc: dhowells, Amir Goldstein, linux-cachefs, Jeff Layton,
David Wysochanski, Matthew Wilcox (Oracle),
J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi
Dave Chinner <david@fromorbit.com> wrote:
> > > There was a discussion about fsyncing a range of files on LSFMM [1].
> > > In the last comment on the article dchinner argues why we already have that
> > > API (and now also with io_uring(), but AFAIK, we do not have a useful
> > > wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> > >
> > > [1] https://lwn.net/Articles/789024/
> >
> > This sounds like an interesting idea. Actually, what I probably want is a
> > notification to say that a particular object has been completely sync'd to
> > disk, metadata and all.
>
> This isn't hard to do yourself in the kernel. All it takes is a
> workqueue to run vfs_fsync() calls asynchronously and for the work
> to queue a local notification/wakeup when the fsync completes...
>
> That's all aio_fsync() does - the notification it queues on
> completion is the AIO completion event for userspace - so I think
> you could do this in about 50 lines of code if you really needed
> it...
I was thinking more in terms of passively finding out when metadata has been
flushed to disk rather than actively forcing it. Obviously I can manually
flush from a worker thread, but that ties up a thread per file I want to
flush (unless I want to do a higher-level sync).
Btw, looking at aio_fsync(), is there any reason it copies the current creds
rather than just taking a ref on them? (Granted, this may not be a question
for you)
> > However, there are some performance problems are arising in my fscache-iter
> > branch:
> >
> > (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
> > setxattr).
>
> Async pipelines using unbound workqueues are your friend.
Maybe. I could just throw everything into a workqueue and let the workqueue
deal with it. There still have to be synchronisation points, though - I can't
schedule a cache-write from a server-read to the cache following a 3rd-party
induced invalidation until after the invalidation has happened - and that
holds up userspace from writing to the cache. But maybe it will work.
Btw, how expensive is it to throw an operation off to a workqueue versus doing
it in thread? Particularly if it's a synchronous op that the thread is going
to have to wait for (e.g. write_begin()).
> > (2) It's retaining a lot of open file structs on cache files. Cachefiles
> > opens the file when it's first asked to access it and retains that till
> > the cookie is relinquished or the cache withdrawn (the file* doesn't
> > contribute to ENFILE/EMFILE but it still eats memory).
>
> Sounds similar to the problem that the NFSd open file cache solves.
> (fs/nfsd/filecache.c)
Looks similiarish to what I was thinking of with having a queue of
currently-not-in-use cookies to go through and commit and close.
> > but if the file is truncated
> > larger, that excess data now becomes part of the file.
>
> Keep the actual file size in your tracking xattr.
I do that, but it doesn't help entirely. If someone truncates the file larger
and then writes non-contiguously, the problem occurs.
I've tried truncating the file down and then truncating it up, but that
requires two synchronous ops - though the latter is relatively cheap. I've
also tried fallocate() to clear the block. What I've found is that the next
DIO write then has to sync because these may read data into the pagecache of
the backing file.
Apart from clearing the tail of a page on writing, it might be better for me
to read the data into a spare page, clear the tail and write it back.
> > Possibly it's sufficient to just clear the excess page space before
> > writing, but that doesn't necessarily stop a writable mmap from
> > scribbling on it.
>
> We can't stop mmap from scribbling in it. All filesystems have this
> problem, so to prevent data leaks we have to zero the post-eof tail
> region on every write of the EOF block, anyway.
I meant an mmap scribbling on it after it's been cleared - but I guess taking
away the PTE-writeable flag and making page_mkwrite() wait should solve that.
> > (4) Committing outstanding cache metadata at cache withdrawal or netfs
> > unmount. I've previously mentioned this: it ends up with a whole
> > slew of synchronous metadata changes being committed to the cache in
> > one go (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile)
> > - and this can take quite a long time. The cache needs to be more
> > proactive in getting stuff committed as it goes along.
>
> Workqueues give you an easy mechanism for async dispatch and
> concurrency for synchronous operations. This is a largely solved
> problem...
Yes and no. Yes, I can fan out the number of threads doing the committing,
but there's still a limit on the I/O bandwidth - and a lot of the operations
still have to hit the disk in the right order. It still stuffs up the user
experience if the cache eats up the entirety of the disk I/O for a few seconds
just because an automount expired.
Probably the progressive committing approach is a better one so that there's
less to do at the end.
> > (5) Attaching to an object requires a pathwalk to it (normally only two
> > steps) and then reading various xattrs on it - all synchronous, but can
> > be punted to a background threadpool.
>
> a.k.a. punting to a workqueue :)
I do that, but it doesn't help so much. Whilst it can mitigate the effect by
running parallel to userspace, userspace tends to move pretty quickly from
open() to read() - at which point we have to wait anyway.
The problem is that all the steps are synchronous and, for the most part, have
to be sequential because there's a dependency chain: 2 x dir-lookup, get LSM
xattrs, get cache xattrs - then read the data if it's present. I might be
able to speculate at the end and read two cache xattrs in parallel, but each
one requires a separate thread to do it.
On top of that, if the user is running a parallel application such as building
a kernel, a CPU running an offloaded I/O thread isn't running a user thread.
What I've found is that increasing the size of the threadpool doesn't actually
affect the time taken.
What I've done in my fscache-iter branch is to have a small thread pool and
offload work to it if there's a thread free - otherwise process the work in
the calling userspace thread and avoid the context switching.
One reason I was wondering about moving to an approach whereby I have an index
that locates all the blocks (which are then kept in a single file) is that I
can probably keep the entire index in RAM and so the lookup costs are vastly
reduced. The downside as Amir pointed out is that metadata coherency is much
harder if I don't just want to blow the cache away if cache isn't properly
committed when the machine is rebooted.
Note that OpenAFS has been using a single-index approach, with each 256K block
of data in its own file. They then zap any file that's newer than the index
file when the cache is started, assuming that that file might be corrupted.
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fscache: Redesigning the on-disk cache
2021-03-08 9:13 ` David Howells
2021-03-08 10:35 ` Amir Goldstein
2021-03-08 11:28 ` Metadata writtenback notification? -- was " David Howells
@ 2021-03-08 18:54 ` J. Bruce Fields
2021-03-08 19:08 ` David Howells
` (2 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: J. Bruce Fields @ 2021-03-08 18:54 UTC (permalink / raw)
To: David Howells
Cc: Amir Goldstein, linux-cachefs, Jeff Layton, David Wysochanski,
Matthew Wilcox (Oracle),
Christoph Hellwig, Dave Chinner, Alexander Viro, linux-afs,
Linux NFS Mailing List, CIFS, ceph-devel, v9fs-developer,
linux-fsdevel, linux-kernel, Miklos Szeredi
On Mon, Mar 08, 2021 at 09:13:55AM +0000, David Howells wrote:
> Amir Goldstein <amir73il@gmail.com> wrote:
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
>
> I can't use that for XFS, Ext4 or btrfs, I suspect. Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.
Could you instead make some sort of explicit contract with the
filesystem? Maybe you'd flag it at mkfs time and query for it before
allowing a filesystem to be used for fscache. You don't need every
filesystem to support fscache, right, just one acceptable one?
--b.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fscache: Redesigning the on-disk cache
2021-03-08 9:13 ` David Howells
` (2 preceding siblings ...)
2021-03-08 18:54 ` J. Bruce Fields
@ 2021-03-08 19:08 ` David Howells
2021-03-08 21:55 ` Dave Chinner
2021-03-09 9:21 ` David Howells
5 siblings, 0 replies; 13+ messages in thread
From: David Howells @ 2021-03-08 19:08 UTC (permalink / raw)
To: J. Bruce Fields
Cc: dhowells, Amir Goldstein, linux-cachefs, Jeff Layton,
David Wysochanski, Matthew Wilcox (Oracle),
Christoph Hellwig, Dave Chinner, Alexander Viro, linux-afs,
Linux NFS Mailing List, CIFS, ceph-devel, v9fs-developer,
linux-fsdevel, linux-kernel, Miklos Szeredi
J. Bruce Fields <bfields@fieldses.org> wrote:
> On Mon, Mar 08, 2021 at 09:13:55AM +0000, David Howells wrote:
> > Amir Goldstein <amir73il@gmail.com> wrote:
> > > With ->fiemap() you can at least make the distinction between a non existing
> > > and an UNWRITTEN extent.
> >
> > I can't use that for XFS, Ext4 or btrfs, I suspect. Christoph and Dave's
> > assertion is that the cache can't rely on the backing filesystem's metadata
> > because these can arbitrarily insert or remove blocks of zeros to bridge or
> > split extents.
>
> Could you instead make some sort of explicit contract with the
> filesystem? Maybe you'd flag it at mkfs time and query for it before
> allowing a filesystem to be used for fscache. You don't need every
> filesystem to support fscache, right, just one acceptable one?
I've asked about that, but the filesystem maintainers are reluctant to do
that.
Something might be possible in ext4 using direct access to jbd2, though I
don't know exactly what facilities that offers.
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fscache: Redesigning the on-disk cache
2021-03-08 9:13 ` David Howells
` (3 preceding siblings ...)
2021-03-08 19:08 ` David Howells
@ 2021-03-08 21:55 ` Dave Chinner
2021-03-09 9:21 ` David Howells
5 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2021-03-08 21:55 UTC (permalink / raw)
To: David Howells
Cc: Amir Goldstein, linux-cachefs, Jeff Layton, David Wysochanski,
Matthew Wilcox (Oracle),
J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi
On Mon, Mar 08, 2021 at 09:13:55AM +0000, David Howells wrote:
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> > > file for every whole operation (which may combine reads and writes).
> >
> > I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
> > to keep track of present data, it's hard to distinguish between an
> > invalid cached range and a valid "cached hole".
>
> I wasn't exactly intending to permit caching over NFS. That leads to fun
> making sure that the superblock you're caching isn't the one that has the
> cache in it.
>
> However, we will need to handle hole-punching being done on a cached netfs,
> even if that's just to completely invalidate the cache for that file.
>
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
>
> I can't use that for XFS, Ext4 or btrfs, I suspect. Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.
Well, that's not the big problem. The issue that makes FIEMAP
unusable for determining if there is user data present in a file is
that on-disk extent maps aren't exactly coherent with in-memory user
data state.
That is, we can have a hole on disk with delalloc user data in
memory. There's user data in the file, just not on disk. Same goes
for unwritten extents - there can be dirty data in memory over an
unwritten extent, and it won't get converted to written until the
data is written back and the filesystem runs a conversion
transaction.
So, yeah, if you use FIEMAP to determine where data lies in a file
that is being actively modified, you're going get corrupt data
sooner rather than later. SEEK_HOLE/DATA are coherent with in
memory user data, so don't have this problem.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: fscache: Redesigning the on-disk cache
2021-03-08 9:13 ` David Howells
` (4 preceding siblings ...)
2021-03-08 21:55 ` Dave Chinner
@ 2021-03-09 9:21 ` David Howells
5 siblings, 0 replies; 13+ messages in thread
From: David Howells @ 2021-03-09 9:21 UTC (permalink / raw)
To: Dave Chinner
Cc: dhowells, Amir Goldstein, linux-cachefs, Jeff Layton,
David Wysochanski, Matthew Wilcox (Oracle),
J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi
Dave Chinner <david@fromorbit.com> wrote:
> > > With ->fiemap() you can at least make the distinction between a non
> > > existing and an UNWRITTEN extent.
> >
> > I can't use that for XFS, Ext4 or btrfs, I suspect. Christoph and Dave's
> > assertion is that the cache can't rely on the backing filesystem's metadata
> > because these can arbitrarily insert or remove blocks of zeros to bridge or
> > split extents.
>
> Well, that's not the big problem. The issue that makes FIEMAP
> unusable for determining if there is user data present in a file is
> that on-disk extent maps aren't exactly coherent with in-memory user
> data state.
>
> That is, we can have a hole on disk with delalloc user data in
> memory. There's user data in the file, just not on disk. Same goes
> for unwritten extents - there can be dirty data in memory over an
> unwritten extent, and it won't get converted to written until the
> data is written back and the filesystem runs a conversion
> transaction.
>
> So, yeah, if you use FIEMAP to determine where data lies in a file
> that is being actively modified, you're going get corrupt data
> sooner rather than later. SEEK_HOLE/DATA are coherent with in
> memory user data, so don't have this problem.
I thought you and/or Christoph said it *was* a problem to use the backing
filesystem's metadata to track presence of data in the cache because the
filesystem (or its tools) can arbitrarily insert blocks of zeros to
bridge/break up extents.
If that is the case, then that is a big problem, and SEEK_HOLE/DATA won't
suffice.
If it's not a problem - maybe if I can set a mark on a file to tell the
filesystem and tools not to do that - then that would obviate the need for me
to store my own maps.
David
^ permalink raw reply [flat|nested] 13+ messages in thread