ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fscache: Redesigning the on-disk cache
@ 2021-03-03 23:20 David Howells
  2021-03-04 13:47 ` fscache: Redesigning the on-disk cache - LRU handling David Howells
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: David Howells @ 2021-03-03 23:20 UTC (permalink / raw)
  To: linux-cachefs
  Cc: dhowells, Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, linux-kernel

I'm looking at redesigning the on-disk cache format used by fscache's
cachefiles driver to try and eliminate the number of synchronous metadata
operations done by the driver, to improve culling performance and to reduce
the amount of opens/files open.  I also need to stop relying on the backing
filesystem to track where I have data stored.

There are a number of options that I've considered:

 (0) The current format lays out a directory tree, with directories for each
     level of index (so in AFS's terms, you've got an overall "afs" dir
     containing a dir for each cell.  In each cell dir, there's a dir for each
     volume and within that there's a file for each afs vnode cached.  Extra
     levels of directory are also interposed to reduce the number of entries
     in a directory.

     - Pathwalk cost to open a cache file.
     - Netfs coherency data is in xattrs.
     - Invalidation done by truncate or unlink.
     - Uses backing filesystem metadata to keep track of present data.
       - Determined by bmap() on the cache file.
     - Culling performed by userspace daemon.
     - Data file opened for every write.
     - Read done by readpage without file.

 (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
      file for every whole operation (which may combine reads and writes).

 (1) Structured the same as (0), but keeping an independent content map and
     not relying on backing fs metadata.  Use a larger blocksize, say 256K, to
     reduce the size of the content map.

     - Netfs coherency data in xattrs.
     - Invalidation done by tmpfile creation and link-replace.
     - Content bitmap kept in xattr.
       - Limited capacity.  Could use multiple bitmaps.
       - Can skip the bitmap for a non-sparse file that we have all of.
     - "Open" state kept in xattr.
     - File is kept open
     - Culling performed by userspace daemon.
     - Cache file open whilst netfs file is open.

 (2) Structured the same as (1), but keeping an extent list instead of a
     bitmap.

     - Content extent map kept in xattr.
       - Limited capacity.
       - Highly scattered extents use a lot of map space.

 (3) OpenAFS-style format.  One index file to look up {file_key,block#} and an
     array of data files, each holding one block (e.g. a 256KiB-aligned chunk
     of a file).  Each index entry has valid start/end offsets for easy
     truncation.

     The index has a hash to facilitate the lookup and an LRU that allows a
     block to be recycled at any time.

     - File keys, are highly variable in length and can be rather long,
       particularly NFS FIDs.
       - Might want a separate file index that maps file keys to a slot ID
       	 that can then be used in the block index.
     - Netfs coherency data in vnode index entry.
     - Invalidation done by clearing matching entries in the index.
       - Dead data files can be lazily unlinked or truncated or just
         overwritten.
     - Content mapping by lookup in block index hash table.
       - Fine if the hash table is large and scatter is good.
     - Potential coherency problem between indices and data file.
     - Culling performed by block index LRU.
     - Really want to retain entire block index in RAM.
     - Data files are opened for every read/write.

 (4) Similar format to (3), but could put entirety of data in one file.

     - Data file open entire time cache online.
     - Unused block bitmap.
     - Can use fallocate to punch out dead blocks.
     - Could put data file on blockdev.

 (5) Similar idea to (4), but just have a file index and use block pointers
     and indirection blocks instead.  Use an LRU in the file index and cull
     whole files only, not individual blocks.

     - File keys, are highly variable in length and can be rather long,
       particularly NFS FIDs.
     - Netfs coherency data in vnode index entry.
     - Unused data block bitmap.
     - Invalidation done by clearing entries in the file index.
       - Data blocks must be recycled and returned to bitmap.
       - Dead data blocks can be lazily punched out with fallocate.
     - Potential coherency problem between index, pointers/indirection and
       bitmap.
     - Culling performed by file index LRU.
     - Really want to retain entire file index and block bitmap in RAM.
       - May be less memory than block index.
     - Data file open entire time cache online.
     - Could put data file on blockdev.
     - If the block size is large, lots of dead space in indirection blocks.

 (6) Similar to (5), but use extent lists rather than indirection blocks.

     - Requires allocation of contiguous space to be worthwhile.
     - Buddy allocator approach?
       - Can always arbitrarily recycle buddies to make larger spaces - if we
       	 can find them...

 (7) Hybrid approach.  Stick the first block of every netfs file in one big
     cache file.  For a lot of cases, that would suffice for the entire file
     if the block size is large enough.  Store the tails of larger files in
     separate files.

     - File index with LRU.
     - More complicated to manage.
     - Fewer files open.

So (0) is what's upstream.  I have (0a) implemented in my fscache-netfs-lib
branch and (1) implemented in my fscache-iter branch.  However, it spends a
lot of cpu time doing synchronous metadata ops, such as creating tmpfiles,
link creation and setting xattrs, particularly when unmounting the filesystem
or disabling the cache - both of which are done during shutdown.

I'm leaning towards (4) or (5).  I could use extent maps, but I don't
necessarily have a good idea of what access patterns I have to deal with till
later.  With network filesystems that are going to read and cache large blocks
(say 2MiB), extents would allow reduction of the metadata, particularly where
it would span a bitmap.

Using a block index (4) allows me to easily recycle a large chunk of cache in
one go - even if it means arbitrarily kicking out blocks that weren't near the
end of the LRU yet.

Using block pointers and indirection blocks (5) means I only need this data in
RAM when I need it; with the LRU management being done in the file index.

Either way with (4) and (5), at least one index really needs to be resident in
RAM to make LRU wangling efficient.  Also, I need to decide how to handle
coherency management - setting an "in use" flag on the file index entry and
flushing it before making any modifications might work.

On the other hand, sticking with (1) or (2) makes it easier to add extra
metadata very easily (say to handle disconnected operation), though it's
harder to manage culling and manage the capacity of the cache.

I have a suspicion that the answer is "it depends" and that the best choice is
very much going to be workload dependent - and may even vary from volume to
volume within the same workload.

Any thoughts or better solutions?

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* fscache: Redesigning the on-disk cache - LRU handling
  2021-03-03 23:20 fscache: Redesigning the on-disk cache David Howells
@ 2021-03-04 13:47 ` David Howells
  2021-03-05  9:46 ` fscache: Redesigning the on-disk cache Amir Goldstein
  2021-03-08  9:13 ` David Howells
  2 siblings, 0 replies; 13+ messages in thread
From: David Howells @ 2021-03-04 13:47 UTC (permalink / raw)
  To: linux-cachefs
  Cc: dhowells, Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, linux-kernel

David Howells <dhowells@redhat.com> wrote:

> 
>  (3) OpenAFS-style format.  One index file to look up {file_key,block#} and an
>      array of data files, each holding one block (e.g. a 256KiB-aligned chunk
>      of a file).  Each index entry has valid start/end offsets for easy
>      truncation.
> 
>      The index has a hash to facilitate the lookup and an LRU that allows a
>      block to be recycled at any time.

The LRU would probably have to be a doubly-linked list so that entries can be
removed from it easily.  This means typically touching two other entries,
which might not be in the same page; further, if the entry is being freed,
we'd need to excise it from the hash chain also, necessitating touching maybe
two more entries - which might also be in different pages.

Maybe the LRU idea plus a free block bitmap could be combined, however.

 (1) Say that there's a bit-pair map, with one bit pair per block.  The pair
     is set to 0 when the block is free.  When the block is accessed, the pair
     is set to 3.

 (2) When we run out of free blocks (ie. pairs that are zero), we decrement
     all the pairs and then look again.

 (3) Excision from the old hash chain would need to be done at allocation,
     though it does give a block whose usage has been reduced to 0 the chance
     to be resurrected.

Possible variations on the theme could be:

 (*) Set the pair to 2, not 3 when accessed.  Set the block to 3 to pin it;
     the process of decrementing all the pairs would leave it at 3.

 (*) Rather than decrementing all pairs at once, have a rotating window that
     does a part of the map at once.

 (*) If a round of decrementing doesn't reduce any pairs to zero, reject a
     request for space.

This would also work for a file index.

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fscache: Redesigning the on-disk cache
  2021-03-03 23:20 fscache: Redesigning the on-disk cache David Howells
  2021-03-04 13:47 ` fscache: Redesigning the on-disk cache - LRU handling David Howells
@ 2021-03-05  9:46 ` Amir Goldstein
  2021-03-08  9:13 ` David Howells
  2 siblings, 0 replies; 13+ messages in thread
From: Amir Goldstein @ 2021-03-05  9:46 UTC (permalink / raw)
  To: David Howells
  Cc: linux-cachefs, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
	v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi

On Thu, Mar 4, 2021 at 4:10 PM David Howells <dhowells@redhat.com> wrote:
>
> I'm looking at redesigning the on-disk cache format used by fscache's
> cachefiles driver to try and eliminate the number of synchronous metadata
> operations done by the driver, to improve culling performance and to reduce
> the amount of opens/files open.  I also need to stop relying on the backing
> filesystem to track where I have data stored.
>
> There are a number of options that I've considered:
>
>  (0) The current format lays out a directory tree, with directories for each
>      level of index (so in AFS's terms, you've got an overall "afs" dir
>      containing a dir for each cell.  In each cell dir, there's a dir for each
>      volume and within that there's a file for each afs vnode cached.  Extra
>      levels of directory are also interposed to reduce the number of entries
>      in a directory.
>
>      - Pathwalk cost to open a cache file.
>      - Netfs coherency data is in xattrs.
>      - Invalidation done by truncate or unlink.
>      - Uses backing filesystem metadata to keep track of present data.
>        - Determined by bmap() on the cache file.
>      - Culling performed by userspace daemon.
>      - Data file opened for every write.
>      - Read done by readpage without file.
>
>  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
>       file for every whole operation (which may combine reads and writes).

I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
to keep track of present data, it's hard to distinguish between an
invalid cached range
and a valid "cached hole".
With ->fiemap() you can at least make the distinction between a non existing
and an UNWRITTEN extent.

>
>  (1) Structured the same as (0), but keeping an independent content map and
>      not relying on backing fs metadata.  Use a larger blocksize, say 256K, to
>      reduce the size of the content map.
>
>      - Netfs coherency data in xattrs.
>      - Invalidation done by tmpfile creation and link-replace.
>      - Content bitmap kept in xattr.
>        - Limited capacity.  Could use multiple bitmaps.
>        - Can skip the bitmap for a non-sparse file that we have all of.
>      - "Open" state kept in xattr.
>      - File is kept open
>      - Culling performed by userspace daemon.
>      - Cache file open whilst netfs file is open.
>
>  (2) Structured the same as (1), but keeping an extent list instead of a
>      bitmap.
>
>      - Content extent map kept in xattr.
>        - Limited capacity.
>        - Highly scattered extents use a lot of map space.
>
>  (3) OpenAFS-style format.  One index file to look up {file_key,block#} and an
>      array of data files, each holding one block (e.g. a 256KiB-aligned chunk
>      of a file).  Each index entry has valid start/end offsets for easy
>      truncation.
>
>      The index has a hash to facilitate the lookup and an LRU that allows a
>      block to be recycled at any time.
>
>      - File keys, are highly variable in length and can be rather long,
>        particularly NFS FIDs.
>        - Might want a separate file index that maps file keys to a slot ID
>          that can then be used in the block index.
>      - Netfs coherency data in vnode index entry.
>      - Invalidation done by clearing matching entries in the index.
>        - Dead data files can be lazily unlinked or truncated or just
>          overwritten.
>      - Content mapping by lookup in block index hash table.
>        - Fine if the hash table is large and scatter is good.
>      - Potential coherency problem between indices and data file.
>      - Culling performed by block index LRU.
>      - Really want to retain entire block index in RAM.
>      - Data files are opened for every read/write.
>
>  (4) Similar format to (3), but could put entirety of data in one file.
>
>      - Data file open entire time cache online.
>      - Unused block bitmap.
>      - Can use fallocate to punch out dead blocks.
>      - Could put data file on blockdev.
>
>  (5) Similar idea to (4), but just have a file index and use block pointers
>      and indirection blocks instead.  Use an LRU in the file index and cull
>      whole files only, not individual blocks.
>
>      - File keys, are highly variable in length and can be rather long,
>        particularly NFS FIDs.
>      - Netfs coherency data in vnode index entry.
>      - Unused data block bitmap.
>      - Invalidation done by clearing entries in the file index.
>        - Data blocks must be recycled and returned to bitmap.
>        - Dead data blocks can be lazily punched out with fallocate.
>      - Potential coherency problem between index, pointers/indirection and
>        bitmap.
>      - Culling performed by file index LRU.
>      - Really want to retain entire file index and block bitmap in RAM.
>        - May be less memory than block index.
>      - Data file open entire time cache online.
>      - Could put data file on blockdev.
>      - If the block size is large, lots of dead space in indirection blocks.
>
>  (6) Similar to (5), but use extent lists rather than indirection blocks.
>
>      - Requires allocation of contiguous space to be worthwhile.
>      - Buddy allocator approach?
>        - Can always arbitrarily recycle buddies to make larger spaces - if we
>          can find them...
>
>  (7) Hybrid approach.  Stick the first block of every netfs file in one big
>      cache file.  For a lot of cases, that would suffice for the entire file
>      if the block size is large enough.  Store the tails of larger files in
>      separate files.
>
>      - File index with LRU.
>      - More complicated to manage.
>      - Fewer files open.
>
> So (0) is what's upstream.  I have (0a) implemented in my fscache-netfs-lib
> branch and (1) implemented in my fscache-iter branch.  However, it spends a
> lot of cpu time doing synchronous metadata ops, such as creating tmpfiles,
> link creation and setting xattrs, particularly when unmounting the filesystem
> or disabling the cache - both of which are done during shutdown.
>

You didn't say much about crash consistency or durability requirements of the
cache. Since cachefiles only syncs the cache on shutdown, I guess you
rely on the hosting filesystem to provide the required ordering guarantees.

How does this work with write through network fs cache if the client system
crashes but the write gets to the server? Client system get restart with older
cached data because disk caches were not flushed before crash. Correct?
Is that case handled? Are the caches invalidated on unclean shutdown?

Anyway, how are those ordering requirements going to be handled when
entire indexing is in a file? You'd practically need to re-implement a
filesystem
journal or only write cache updates to a temp file that can be discarded at
any time?

> I'm leaning towards (4) or (5).  I could use extent maps, but I don't
> necessarily have a good idea of what access patterns I have to deal with till
> later.  With network filesystems that are going to read and cache large blocks
> (say 2MiB), extents would allow reduction of the metadata, particularly where
> it would span a bitmap.
>
> Using a block index (4) allows me to easily recycle a large chunk of cache in
> one go - even if it means arbitrarily kicking out blocks that weren't near the
> end of the LRU yet.
>
> Using block pointers and indirection blocks (5) means I only need this data in
> RAM when I need it; with the LRU management being done in the file index.
>
> Either way with (4) and (5), at least one index really needs to be resident in
> RAM to make LRU wangling efficient.  Also, I need to decide how to handle
> coherency management - setting an "in use" flag on the file index entry and
> flushing it before making any modifications might work.
>
> On the other hand, sticking with (1) or (2) makes it easier to add extra
> metadata very easily (say to handle disconnected operation), though it's
> harder to manage culling and manage the capacity of the cache.
>
> I have a suspicion that the answer is "it depends" and that the best choice is
> very much going to be workload dependent - and may even vary from volume to
> volume within the same workload.
>
> Any thoughts or better solutions?
>

If you come up with a useful generic implementation of a "file data overlay",
overlayfs could also use it for "partial copy up" as well as for implementation
of address space operations, so please keep that in mind.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fscache: Redesigning the on-disk cache
  2021-03-03 23:20 fscache: Redesigning the on-disk cache David Howells
  2021-03-04 13:47 ` fscache: Redesigning the on-disk cache - LRU handling David Howells
  2021-03-05  9:46 ` fscache: Redesigning the on-disk cache Amir Goldstein
@ 2021-03-08  9:13 ` David Howells
  2021-03-08 10:35   ` Amir Goldstein
                     ` (5 more replies)
  2 siblings, 6 replies; 13+ messages in thread
From: David Howells @ 2021-03-08  9:13 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: dhowells, linux-cachefs, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
	v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi

Amir Goldstein <amir73il@gmail.com> wrote:

> >  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> >       file for every whole operation (which may combine reads and writes).
> 
> I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
> to keep track of present data, it's hard to distinguish between an
> invalid cached range and a valid "cached hole".

I wasn't exactly intending to permit caching over NFS.  That leads to fun
making sure that the superblock you're caching isn't the one that has the
cache in it.

However, we will need to handle hole-punching being done on a cached netfs,
even if that's just to completely invalidate the cache for that file.

> With ->fiemap() you can at least make the distinction between a non existing
> and an UNWRITTEN extent.

I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
assertion is that the cache can't rely on the backing filesystem's metadata
because these can arbitrarily insert or remove blocks of zeros to bridge or
split extents.

> You didn't say much about crash consistency or durability requirements of the
> cache. Since cachefiles only syncs the cache on shutdown, I guess you
> rely on the hosting filesystem to provide the required ordering guarantees.

There's an xattr on each file in the cache to record the state.  I use this
mark a cache file "open".  If, when I look up a file, the file is marked open,
it is just discarded at the moment.

Now, there are two types of data stored in the cache: data that has to be
stored as a single complete blob and is replaced as such (e.g. symlinks and
AFS dirs) and data that might be randomly modified (e.g. regular files).

For the former, I have code, though in yet another branch, that writes this in
a tmpfile, sets the xattrs and then uses vfs_link(LINK_REPLACE) to cut over.

For the latter, that's harder to do as it would require copying the data to
the tmpfile before we're allowed to modify it.  However, if it's possible to
create a tmpfile that's a CoW version of a data file, I could go down that
route.

But after I've written and sync'd the data, I set the xattr to mark the file
not open.  At the moment I'm doing this too lazily, only doing it when a netfs
file gets evicted or when the cache gets withdrawn, but I really need to add a
queue of objects to be sealed as they're closed.  The balance is working out
how often to do the sealing as something like a shell script can do a lot of
consecutive open/write/close ops.

> How does this work with write through network fs cache if the client system
> crashes but the write gets to the server?

The presumption is that the coherency info on the server will change, but
won't get updated in the cache.

> Client system get restart with older cached data because disk caches were
> not flushed before crash. Correct?  Is that case handled? Are the caches
> invalidated on unclean shutdown?

The netfs provides some coherency info for the cache to store.  For AFS, for
example, this is the data version number (though it should probably include
the volume creation time too).  This is stored with the state info in the same
xattr and is only updated when the "open" state is cleared.

When the cache file is reopened, if the coherency info doesn't match what
we're expecting (presumably we queried the server), the file is discarded.

(Note that the coherency info is netfs-specific)

> Anyway, how are those ordering requirements going to be handled when entire
> indexing is in a file? You'd practically need to re-implement a filesystem

Yes, the though has occurred to me too.  I would be implementing a "simple"
filesystem - and we have lots of those:-/.  The most obvious solution is to
use the backing filesystem's metadata - except that that's not possible.

> journal or only write cache updates to a temp file that can be discarded at
> any time?

It might involve keeping a bitmap of "open" blocks.  Those blocks get
invalidated when the cache restarts.  The simplest solution would be to wipe
the entire cache in such a situation, but that goes against one of the
important features I want out of it.

Actually, a journal of open and closed blocks might be better, though all I
really need to store for each block is a 32-bit number.

It's a particular problem if I'm doing DIO to the data storage area but
buffering the changes to the metadata.  Further, the metadata and data might
be on different media, just to add to the complexity.

Another possibility is only to cull blocks when the parent file is culled.
That probably makes more sense as, as long as the file is registered culled on
disk first and I don't reuse the file slot too quickly, I can write to the
data store before updating the metadata.

> If you come up with a useful generic implementation of a "file data
> overlay", overlayfs could also use it for "partial copy up" as well as for
> implementation of address space operations, so please keep that in mind.

I'm trying to implement things so that the netfs does look-aside when reading,
and multi-destination write-back when writing - but the netfs is in the
driving seat and the cache is invisible to the user.  I really want to avoid
overlaying the cache on the netfs so that the cache is the primary access
point.

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fscache: Redesigning the on-disk cache
  2021-03-08  9:13 ` David Howells
@ 2021-03-08 10:35   ` Amir Goldstein
  2021-03-08 11:28   ` Metadata writtenback notification? -- was " David Howells
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Amir Goldstein @ 2021-03-08 10:35 UTC (permalink / raw)
  To: David Howells
  Cc: linux-cachefs, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
	v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi

On Mon, Mar 8, 2021 at 11:14 AM David Howells <dhowells@redhat.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > >  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> > >       file for every whole operation (which may combine reads and writes).
> >
> > I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
> > to keep track of present data, it's hard to distinguish between an
> > invalid cached range and a valid "cached hole".
>
> I wasn't exactly intending to permit caching over NFS.  That leads to fun
> making sure that the superblock you're caching isn't the one that has the
> cache in it.
>
> However, we will need to handle hole-punching being done on a cached netfs,
> even if that's just to completely invalidate the cache for that file.
>
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
>
> I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.
>
> > You didn't say much about crash consistency or durability requirements of the
> > cache. Since cachefiles only syncs the cache on shutdown, I guess you
> > rely on the hosting filesystem to provide the required ordering guarantees.
>
> There's an xattr on each file in the cache to record the state.  I use this
> mark a cache file "open".  If, when I look up a file, the file is marked open,
> it is just discarded at the moment.
>
> Now, there are two types of data stored in the cache: data that has to be
> stored as a single complete blob and is replaced as such (e.g. symlinks and
> AFS dirs) and data that might be randomly modified (e.g. regular files).
>
> For the former, I have code, though in yet another branch, that writes this in
> a tmpfile, sets the xattrs and then uses vfs_link(LINK_REPLACE) to cut over.
>
> For the latter, that's harder to do as it would require copying the data to
> the tmpfile before we're allowed to modify it.  However, if it's possible to
> create a tmpfile that's a CoW version of a data file, I could go down that
> route.
>
> But after I've written and sync'd the data, I set the xattr to mark the file
> not open.  At the moment I'm doing this too lazily, only doing it when a netfs
> file gets evicted or when the cache gets withdrawn, but I really need to add a
> queue of objects to be sealed as they're closed.  The balance is working out
> how often to do the sealing as something like a shell script can do a lot of
> consecutive open/write/close ops.
>

You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
For example, xfs keeps the "LSN" on each inode, so once the transaction
with some LSN has been committed, all the relevant inodes, if not dirty, can
be declared as synced, without having to call fsync() on any file and without
having to force transaction commit or any IO at all.

Since fscache takes care of submitting the IO, and it shouldn't care about any
specific time that the data/metadata hits the disk(?), you can make use of the
existing periodic writeback and rolling transaction commit and only ever need
to wait for that to happen before marking cache files "closed".

There was a discussion about fsyncing a range of files on LSFMM [1].
In the last comment on the article dchinner argues why we already have that
API (and now also with io_uring(), but AFAIK, we do not have a useful
wait_for_sync() API. And it doesn't need to be exposed to userspace at all.

[1] https://lwn.net/Articles/789024/

> > Anyway, how are those ordering requirements going to be handled when entire
> > indexing is in a file? You'd practically need to re-implement a filesystem
>
> Yes, the though has occurred to me too.  I would be implementing a "simple"
> filesystem - and we have lots of those:-/.  The most obvious solution is to
> use the backing filesystem's metadata - except that that's not possible.
>
> > journal or only write cache updates to a temp file that can be discarded at
> > any time?
>
> It might involve keeping a bitmap of "open" blocks.  Those blocks get
> invalidated when the cache restarts.  The simplest solution would be to wipe
> the entire cache in such a situation, but that goes against one of the
> important features I want out of it.
>
> Actually, a journal of open and closed blocks might be better, though all I
> really need to store for each block is a 32-bit number.
>
> It's a particular problem if I'm doing DIO to the data storage area but
> buffering the changes to the metadata.  Further, the metadata and data might
> be on different media, just to add to the complexity.
>
> Another possibility is only to cull blocks when the parent file is culled.
> That probably makes more sense as, as long as the file is registered culled on
> disk first and I don't reuse the file slot too quickly, I can write to the
> data store before updating the metadata.
>

If I were you, I would try to avoid re-implementing a journaled filesystem or
a database for fscache and try to make use of crash consistency guarantees
that filesystems already provide.
Namely, use the data dependency already provided by temp files.
It doesn't need to be one temp file per cached file.

Always easier said than done ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache
  2021-03-08  9:13 ` David Howells
  2021-03-08 10:35   ` Amir Goldstein
@ 2021-03-08 11:28   ` David Howells
  2021-03-08 22:32     ` Dave Chinner
  2021-03-09 11:27     ` David Howells
  2021-03-08 18:54   ` J. Bruce Fields
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: David Howells @ 2021-03-08 11:28 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: dhowells, linux-cachefs, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
	v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi

Amir Goldstein <amir73il@gmail.com> wrote:

> > But after I've written and sync'd the data, I set the xattr to mark the
> > file not open.  At the moment I'm doing this too lazily, only doing it
> > when a netfs file gets evicted or when the cache gets withdrawn, but I
> > really need to add a queue of objects to be sealed as they're closed.  The
> > balance is working out how often to do the sealing as something like a
> > shell script can do a lot of consecutive open/write/close ops.
> 
> You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
> For example, xfs keeps the "LSN" on each inode, so once the transaction
> with some LSN has been committed, all the relevant inodes, if not dirty, can
> be declared as synced, without having to call fsync() on any file and without
> having to force transaction commit or any IO at all.
> 
> Since fscache takes care of submitting the IO, and it shouldn't care about any
> specific time that the data/metadata hits the disk(?), you can make use of the
> existing periodic writeback and rolling transaction commit and only ever need
> to wait for that to happen before marking cache files "closed".
> 
> There was a discussion about fsyncing a range of files on LSFMM [1].
> In the last comment on the article dchinner argues why we already have that
> API (and now also with io_uring(), but AFAIK, we do not have a useful
> wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> 
> [1] https://lwn.net/Articles/789024/

This sounds like an interesting idea.  Actually, what I probably want is a
notification to say that a particular object has been completely sync'd to
disk, metadata and all.

I'm not sure that io_uring is particularly usable from within the kernel,
though.

> If I were you, I would try to avoid re-implementing a journaled filesystem or
> a database for fscache and try to make use of crash consistency guarantees
> that filesystems already provide.
> Namely, use the data dependency already provided by temp files.
> It doesn't need to be one temp file per cached file.
> 
> Always easier said than done ;-)

Yes.

There are a number of considerations I have to deal with, and they're somewhat
at odds with each other:

 (1) I need to record what data I have stored from a file.

 (2) I need to record where I stored the data.

 (3) I need to make sure that I don't see old data.

 (4) I need to make sure that I don't see data in the wrong file.

 (5) I need to make sure I lose as little as possible on a crash.

 (6) I want to be able to record what changes were made in the event we're
     disconnected from the server.

For my fscache-iter branch, (1) is done with a map in an xattr, but I only
cache up to 1G in a file at the moment; (2), (4) and, to some extent (5), are
handled by the backing fs; (3) is handled by tagging the file and storing
coherency data in in an xattr (though tmpfiles are used on full invalidation).
(6) is not yet supported.

For upstream, (1), (2), (4) and to some extent (5) are handled through the
backing fs.  (3) is handled by storing coherency data in an xattr and
truncating the file on invalidation; (6) is not yet supported.

However, there are some performance problems are arising in my fscache-iter
branch:

 (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
     setxattr).

 (2) It's retaining a lot of open file structs on cache files.  Cachefiles
     opens the file when it's first asked to access it and retains that till
     the cookie is relinquished or the cache withdrawn (the file* doesn't
     contribute to ENFILE/EMFILE but it still eats memory).

     I can mitigate this by closing much sooner, perhaps opening the file for
     each operation - but at the cost of having to spend time doing more opens
     and closes.  What's in upstream gets away without having to do open/close
     for reads because it calls readpage.

     Alternatively, I can have a background file closer - which requires an
     LRU queue.  This could be combined with a file "sealer".

     Deferred writeback on the netfs starting writes to the cache makes this
     more interesting as I have to retain the interest on the cache object
     beyond the netfs file being closed.

 (3) Trimming excess data from the end of the cache file.  The problem with
     using DIO to write to the cache is that the write has to be rounded up to
     a multiple of the backing fs DIO blocksize, but if the file is truncated
     larger, that excess data now becomes part of the file.

     Possibly it's sufficient to just clear the excess page space before
     writing, but that doesn't necessarily stop a writable mmap from
     scribbling on it.

 (4) Committing outstanding cache metadata at cache withdrawal or netfs
     unmount.  I've previously mentioned this: it ends up with a whole slew of
     synchronous metadata changes being committed to the cache in one go
     (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this
     can take quite a long time.  The cache needs to be more proactive in
     getting stuff committed as it goes along.

 (5) Attaching to an object requires a pathwalk to it (normally only two
     steps) and then reading various xattrs on it - all synchronous, but can
     be punted to a background threadpool.

Amongst the reasons I was considering moving to an index and a single datafile
is to replace the path-lookup step for each object and the xattr reads to
looking in a single file and to reduce the number of open files in the cache
at any one time to around four.

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fscache: Redesigning the on-disk cache
  2021-03-08  9:13 ` David Howells
  2021-03-08 10:35   ` Amir Goldstein
  2021-03-08 11:28   ` Metadata writtenback notification? -- was " David Howells
@ 2021-03-08 18:54   ` J. Bruce Fields
  2021-03-08 19:08   ` David Howells
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: J. Bruce Fields @ 2021-03-08 18:54 UTC (permalink / raw)
  To: David Howells
  Cc: Amir Goldstein, linux-cachefs, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	Christoph Hellwig, Dave Chinner, Alexander Viro, linux-afs,
	Linux NFS Mailing List, CIFS, ceph-devel, v9fs-developer,
	linux-fsdevel, linux-kernel, Miklos Szeredi

On Mon, Mar 08, 2021 at 09:13:55AM +0000, David Howells wrote:
> Amir Goldstein <amir73il@gmail.com> wrote:
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
> 
> I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.

Could you instead make some sort of explicit contract with the
filesystem?  Maybe you'd flag it at mkfs time and query for it before
allowing a filesystem to be used for fscache.  You don't need every
filesystem to support fscache, right, just one acceptable one?

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fscache: Redesigning the on-disk cache
  2021-03-08  9:13 ` David Howells
                     ` (2 preceding siblings ...)
  2021-03-08 18:54   ` J. Bruce Fields
@ 2021-03-08 19:08   ` David Howells
  2021-03-08 21:55   ` Dave Chinner
  2021-03-09  9:21   ` David Howells
  5 siblings, 0 replies; 13+ messages in thread
From: David Howells @ 2021-03-08 19:08 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: dhowells, Amir Goldstein, linux-cachefs, Jeff Layton,
	David Wysochanski, Matthew Wilcox (Oracle),
	Christoph Hellwig, Dave Chinner, Alexander Viro, linux-afs,
	Linux NFS Mailing List, CIFS, ceph-devel, v9fs-developer,
	linux-fsdevel, linux-kernel, Miklos Szeredi

J. Bruce Fields <bfields@fieldses.org> wrote:

> On Mon, Mar 08, 2021 at 09:13:55AM +0000, David Howells wrote:
> > Amir Goldstein <amir73il@gmail.com> wrote:
> > > With ->fiemap() you can at least make the distinction between a non existing
> > > and an UNWRITTEN extent.
> > 
> > I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> > assertion is that the cache can't rely on the backing filesystem's metadata
> > because these can arbitrarily insert or remove blocks of zeros to bridge or
> > split extents.
> 
> Could you instead make some sort of explicit contract with the
> filesystem?  Maybe you'd flag it at mkfs time and query for it before
> allowing a filesystem to be used for fscache.  You don't need every
> filesystem to support fscache, right, just one acceptable one?

I've asked about that, but the filesystem maintainers are reluctant to do
that.

Something might be possible in ext4 using direct access to jbd2, though I
don't know exactly what facilities that offers.

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fscache: Redesigning the on-disk cache
  2021-03-08  9:13 ` David Howells
                     ` (3 preceding siblings ...)
  2021-03-08 19:08   ` David Howells
@ 2021-03-08 21:55   ` Dave Chinner
  2021-03-09  9:21   ` David Howells
  5 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2021-03-08 21:55 UTC (permalink / raw)
  To: David Howells
  Cc: Amir Goldstein, linux-cachefs, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
	v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi

On Mon, Mar 08, 2021 at 09:13:55AM +0000, David Howells wrote:
> Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > >  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> > >       file for every whole operation (which may combine reads and writes).
> > 
> > I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
> > to keep track of present data, it's hard to distinguish between an
> > invalid cached range and a valid "cached hole".
> 
> I wasn't exactly intending to permit caching over NFS.  That leads to fun
> making sure that the superblock you're caching isn't the one that has the
> cache in it.
> 
> However, we will need to handle hole-punching being done on a cached netfs,
> even if that's just to completely invalidate the cache for that file.
> 
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
> 
> I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.

Well, that's not the big problem. The issue that makes FIEMAP
unusable for determining if there is user data present in a file is
that on-disk extent maps aren't exactly coherent with in-memory user
data state.

That is, we can have a hole on disk with delalloc user data in
memory.  There's user data in the file, just not on disk. Same goes
for unwritten extents - there can be dirty data in memory over an
unwritten extent, and it won't get converted to written until the
data is written back and the filesystem runs a conversion
transaction.

So, yeah, if you use FIEMAP to determine where data lies in a file
that is being actively modified, you're going get corrupt data
sooner rather than later.  SEEK_HOLE/DATA are coherent with in
memory user data, so don't have this problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache
  2021-03-08 11:28   ` Metadata writtenback notification? -- was " David Howells
@ 2021-03-08 22:32     ` Dave Chinner
  2021-03-08 23:20       ` Matthew Wilcox
  2021-03-09 11:27     ` David Howells
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2021-03-08 22:32 UTC (permalink / raw)
  To: David Howells
  Cc: Amir Goldstein, linux-cachefs, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
	v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi

On Mon, Mar 08, 2021 at 11:28:41AM +0000, David Howells wrote:
> Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > > But after I've written and sync'd the data, I set the xattr to mark the
> > > file not open.  At the moment I'm doing this too lazily, only doing it
> > > when a netfs file gets evicted or when the cache gets withdrawn, but I
> > > really need to add a queue of objects to be sealed as they're closed.  The
> > > balance is working out how often to do the sealing as something like a
> > > shell script can do a lot of consecutive open/write/close ops.
> > 
> > You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
> > For example, xfs keeps the "LSN" on each inode, so once the transaction
> > with some LSN has been committed, all the relevant inodes, if not dirty, can
> > be declared as synced, without having to call fsync() on any file and without
> > having to force transaction commit or any IO at all.
> > 
> > Since fscache takes care of submitting the IO, and it shouldn't care about any
> > specific time that the data/metadata hits the disk(?), you can make use of the
> > existing periodic writeback and rolling transaction commit and only ever need
> > to wait for that to happen before marking cache files "closed".
> > 
> > There was a discussion about fsyncing a range of files on LSFMM [1].
> > In the last comment on the article dchinner argues why we already have that
> > API (and now also with io_uring(), but AFAIK, we do not have a useful
> > wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> > 
> > [1] https://lwn.net/Articles/789024/
> 
> This sounds like an interesting idea.  Actually, what I probably want is a
> notification to say that a particular object has been completely sync'd to
> disk, metadata and all.

This isn't hard to do yourself in the kernel. All it takes is a
workqueue to run vfs_fsync() calls asynchronously and for the work
to queue a local notification/wakeup when the fsync completes...

That's all aio_fsync() does - the notification it queues on
completion is the AIO completion event for userspace - so I think
you could do this in about 50 lines of code if you really needed
it...

> However, there are some performance problems are arising in my fscache-iter
> branch:
> 
>  (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
>      setxattr).

Async pipelines using unbound workqueues are your friend.
> 
>  (2) It's retaining a lot of open file structs on cache files.  Cachefiles
>      opens the file when it's first asked to access it and retains that till
>      the cookie is relinquished or the cache withdrawn (the file* doesn't
>      contribute to ENFILE/EMFILE but it still eats memory).

Sounds similar to the problem that the NFSd open file cache solves.
(fs/nfsd/filecache.c)

>  (3) Trimming excess data from the end of the cache file.  The problem with
>      using DIO to write to the cache is that the write has to be rounded up to
>      a multiple of the backing fs DIO blocksize,

Actually, a multiple of the logical sector size of the backing
device behind the backing filesystem.

>      but if the file is truncated
>      larger, that excess data now becomes part of the file.

Keep the actual file size in your tracking xattr.

>      Possibly it's sufficient to just clear the excess page space before
>      writing, but that doesn't necessarily stop a writable mmap from
>      scribbling on it.

We can't stop mmap from scribbling in it. All filesystems have this
problem, so to prevent data leaks we have to zero the post-eof tail
region on every write of the EOF block, anyway.

>  (4) Committing outstanding cache metadata at cache withdrawal or netfs
>      unmount.  I've previously mentioned this: it ends up with a whole slew of
>      synchronous metadata changes being committed to the cache in one go
>      (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this
>      can take quite a long time.  The cache needs to be more proactive in
>      getting stuff committed as it goes along.

Workqueues give you an easy mechanism for async dispatch and
concurrency for synchronous operations. This is a largely solved
problem...

>  (5) Attaching to an object requires a pathwalk to it (normally only two
>      steps) and then reading various xattrs on it - all synchronous, but can
>      be punted to a background threadpool.

a.k.a. punting to a workqueue :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache
  2021-03-08 22:32     ` Dave Chinner
@ 2021-03-08 23:20       ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2021-03-08 23:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Howells, Amir Goldstein, linux-cachefs, Jeff Layton,
	David Wysochanski, J. Bruce Fields, Christoph Hellwig,
	Dave Chinner, Alexander Viro, linux-afs, Linux NFS Mailing List,
	CIFS, ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel,
	Miklos Szeredi

On Tue, Mar 09, 2021 at 09:32:47AM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 11:28:41AM +0000, David Howells wrote:
> >      Possibly it's sufficient to just clear the excess page space before
> >      writing, but that doesn't necessarily stop a writable mmap from
> >      scribbling on it.
> 
> We can't stop mmap from scribbling in it. All filesystems have this
> problem, so to prevent data leaks we have to zero the post-eof tail
> region on every write of the EOF block, anyway.

That's certainly one approach.  Another would be to zero it during the I/O
completion handler.  It depends whether you can trust the last writer or
not (eg what do we do with an isofs file that happens to contain garbage
after the last byte in the file?)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fscache: Redesigning the on-disk cache
  2021-03-08  9:13 ` David Howells
                     ` (4 preceding siblings ...)
  2021-03-08 21:55   ` Dave Chinner
@ 2021-03-09  9:21   ` David Howells
  5 siblings, 0 replies; 13+ messages in thread
From: David Howells @ 2021-03-09  9:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dhowells, Amir Goldstein, linux-cachefs, Jeff Layton,
	David Wysochanski, Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
	v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi

Dave Chinner <david@fromorbit.com> wrote:

> > > With ->fiemap() you can at least make the distinction between a non
> > > existing and an UNWRITTEN extent.
> > 
> > I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> > assertion is that the cache can't rely on the backing filesystem's metadata
> > because these can arbitrarily insert or remove blocks of zeros to bridge or
> > split extents.
> 
> Well, that's not the big problem. The issue that makes FIEMAP
> unusable for determining if there is user data present in a file is
> that on-disk extent maps aren't exactly coherent with in-memory user
> data state.
> 
> That is, we can have a hole on disk with delalloc user data in
> memory.  There's user data in the file, just not on disk. Same goes
> for unwritten extents - there can be dirty data in memory over an
> unwritten extent, and it won't get converted to written until the
> data is written back and the filesystem runs a conversion
> transaction.
> 
> So, yeah, if you use FIEMAP to determine where data lies in a file
> that is being actively modified, you're going get corrupt data
> sooner rather than later.  SEEK_HOLE/DATA are coherent with in
> memory user data, so don't have this problem.

I thought you and/or Christoph said it *was* a problem to use the backing
filesystem's metadata to track presence of data in the cache because the
filesystem (or its tools) can arbitrarily insert blocks of zeros to
bridge/break up extents.

If that is the case, then that is a big problem, and SEEK_HOLE/DATA won't
suffice.

If it's not a problem - maybe if I can set a mark on a file to tell the
filesystem and tools not to do that - then that would obviate the need for me
to store my own maps.

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache
  2021-03-08 11:28   ` Metadata writtenback notification? -- was " David Howells
  2021-03-08 22:32     ` Dave Chinner
@ 2021-03-09 11:27     ` David Howells
  1 sibling, 0 replies; 13+ messages in thread
From: David Howells @ 2021-03-09 11:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dhowells, Amir Goldstein, linux-cachefs, Jeff Layton,
	David Wysochanski, Matthew Wilcox (Oracle),
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Alexander Viro,
	linux-afs, Linux NFS Mailing List, CIFS, ceph-devel,
	v9fs-developer, linux-fsdevel, linux-kernel, Miklos Szeredi

Dave Chinner <david@fromorbit.com> wrote:

> > > There was a discussion about fsyncing a range of files on LSFMM [1].
> > > In the last comment on the article dchinner argues why we already have that
> > > API (and now also with io_uring(), but AFAIK, we do not have a useful
> > > wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> > > 
> > > [1] https://lwn.net/Articles/789024/
> > 
> > This sounds like an interesting idea.  Actually, what I probably want is a
> > notification to say that a particular object has been completely sync'd to
> > disk, metadata and all.
> 
> This isn't hard to do yourself in the kernel. All it takes is a
> workqueue to run vfs_fsync() calls asynchronously and for the work
> to queue a local notification/wakeup when the fsync completes...
> 
> That's all aio_fsync() does - the notification it queues on
> completion is the AIO completion event for userspace - so I think
> you could do this in about 50 lines of code if you really needed
> it...

I was thinking more in terms of passively finding out when metadata has been
flushed to disk rather than actively forcing it.  Obviously I can manually
flush from a worker thread, but that ties up a thread per file I want to
flush (unless I want to do a higher-level sync).

Btw, looking at aio_fsync(), is there any reason it copies the current creds
rather than just taking a ref on them?  (Granted, this may not be a question
for you)

> > However, there are some performance problems are arising in my fscache-iter
> > branch:
> > 
> >  (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
> >      setxattr).
> 
> Async pipelines using unbound workqueues are your friend.

Maybe.  I could just throw everything into a workqueue and let the workqueue
deal with it.  There still have to be synchronisation points, though - I can't
schedule a cache-write from a server-read to the cache following a 3rd-party
induced invalidation until after the invalidation has happened - and that
holds up userspace from writing to the cache.  But maybe it will work.

Btw, how expensive is it to throw an operation off to a workqueue versus doing
it in thread?  Particularly if it's a synchronous op that the thread is going
to have to wait for (e.g. write_begin()).

> >  (2) It's retaining a lot of open file structs on cache files.  Cachefiles
> >      opens the file when it's first asked to access it and retains that till
> >      the cookie is relinquished or the cache withdrawn (the file* doesn't
> >      contribute to ENFILE/EMFILE but it still eats memory).
> 
> Sounds similar to the problem that the NFSd open file cache solves.
> (fs/nfsd/filecache.c)

Looks similiarish to what I was thinking of with having a queue of
currently-not-in-use cookies to go through and commit and close.

> >      but if the file is truncated
> >      larger, that excess data now becomes part of the file.
> 
> Keep the actual file size in your tracking xattr.

I do that, but it doesn't help entirely.  If someone truncates the file larger
and then writes non-contiguously, the problem occurs.

I've tried truncating the file down and then truncating it up, but that
requires two synchronous ops - though the latter is relatively cheap.  I've
also tried fallocate() to clear the block.  What I've found is that the next
DIO write then has to sync because these may read data into the pagecache of
the backing file.

Apart from clearing the tail of a page on writing, it might be better for me
to read the data into a spare page, clear the tail and write it back.

> >      Possibly it's sufficient to just clear the excess page space before
> >      writing, but that doesn't necessarily stop a writable mmap from
> >      scribbling on it.
> 
> We can't stop mmap from scribbling in it. All filesystems have this
> problem, so to prevent data leaks we have to zero the post-eof tail
> region on every write of the EOF block, anyway.

I meant an mmap scribbling on it after it's been cleared - but I guess taking
away the PTE-writeable flag and making page_mkwrite() wait should solve that.

> >  (4) Committing outstanding cache metadata at cache withdrawal or netfs
> >      unmount.  I've previously mentioned this: it ends up with a whole
> >      slew of synchronous metadata changes being committed to the cache in
> >      one go (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile)
> >      - and this can take quite a long time.  The cache needs to be more
> >      proactive in getting stuff committed as it goes along.
> 
> Workqueues give you an easy mechanism for async dispatch and
> concurrency for synchronous operations. This is a largely solved
> problem...

Yes and no.  Yes, I can fan out the number of threads doing the committing,
but there's still a limit on the I/O bandwidth - and a lot of the operations
still have to hit the disk in the right order.  It still stuffs up the user
experience if the cache eats up the entirety of the disk I/O for a few seconds
just because an automount expired.

Probably the progressive committing approach is a better one so that there's
less to do at the end.

> >  (5) Attaching to an object requires a pathwalk to it (normally only two
> >      steps) and then reading various xattrs on it - all synchronous, but can
> >      be punted to a background threadpool.
> 
> a.k.a. punting to a workqueue :)

I do that, but it doesn't help so much.  Whilst it can mitigate the effect by
running parallel to userspace, userspace tends to move pretty quickly from
open() to read() - at which point we have to wait anyway.

The problem is that all the steps are synchronous and, for the most part, have
to be sequential because there's a dependency chain: 2 x dir-lookup, get LSM
xattrs, get cache xattrs - then read the data if it's present.  I might be
able to speculate at the end and read two cache xattrs in parallel, but each
one requires a separate thread to do it.

On top of that, if the user is running a parallel application such as building
a kernel, a CPU running an offloaded I/O thread isn't running a user thread.
What I've found is that increasing the size of the threadpool doesn't actually
affect the time taken.

What I've done in my fscache-iter branch is to have a small thread pool and
offload work to it if there's a thread free - otherwise process the work in
the calling userspace thread and avoid the context switching.


One reason I was wondering about moving to an approach whereby I have an index
that locates all the blocks (which are then kept in a single file) is that I
can probably keep the entire index in RAM and so the lookup costs are vastly
reduced.  The downside as Amir pointed out is that metadata coherency is much
harder if I don't just want to blow the cache away if cache isn't properly
committed when the machine is rebooted.

Note that OpenAFS has been using a single-index approach, with each 256K block
of data in its own file.  They then zap any file that's newer than the index
file when the cache is started, assuming that that file might be corrupted.

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-03-09 11:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-03 23:20 fscache: Redesigning the on-disk cache David Howells
2021-03-04 13:47 ` fscache: Redesigning the on-disk cache - LRU handling David Howells
2021-03-05  9:46 ` fscache: Redesigning the on-disk cache Amir Goldstein
2021-03-08  9:13 ` David Howells
2021-03-08 10:35   ` Amir Goldstein
2021-03-08 11:28   ` Metadata writtenback notification? -- was " David Howells
2021-03-08 22:32     ` Dave Chinner
2021-03-08 23:20       ` Matthew Wilcox
2021-03-09 11:27     ` David Howells
2021-03-08 18:54   ` J. Bruce Fields
2021-03-08 19:08   ` David Howells
2021-03-08 21:55   ` Dave Chinner
2021-03-09  9:21   ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).