Could it be made possible to offer "supplementary" data to a DIO write ?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Could it be made possible to offer "supplementary" data to a DIO write ?
@ 2021-08-05 10:19 David Howells
  2021-08-05 12:37 ` Matthew Wilcox
  2021-08-05 13:07 ` David Howells
  0 siblings, 2 replies; 22+ messages in thread
From: David Howells @ 2021-08-05 10:19 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, jlayton, Matthew Wilcox, Christoph Hellwig,
	Linus Torvalds, dchinner, linux-block, linux-kernel

Hi,

I'm working on network filesystem write helpers to go with the read helpers,
and I see situations where I want to write a few bytes to the cache, but have
more available that could be written also if it would allow the
filesystem/blockdev to optimise its layout.

Say, for example, I need to write a 3-byte change from a page, where that page
is part of a 256K sequence in the pagecache.  Currently, I have to round the
3-bytes out to DIO size/alignment, but I could say to the API, for example,
"here's a 256K iterator - I need bytes 225-227 written, but you can write more
if you want to"?

Would it be useful/feasible to have some sort of interface that allows the
offer to be made?

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Could it be made possible to offer "supplementary" data to a DIO write ?
  2021-08-05 10:19 Could it be made possible to offer "supplementary" data to a DIO write ? David Howells
@ 2021-08-05 12:37 ` Matthew Wilcox
  2021-08-05 13:07 ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2021-08-05 12:37 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, jlayton, Christoph Hellwig, Linus Torvalds,
	dchinner, linux-block, linux-kernel

On Thu, Aug 05, 2021 at 11:19:17AM +0100, David Howells wrote:
> I'm working on network filesystem write helpers to go with the read helpers,
> and I see situations where I want to write a few bytes to the cache, but have
> more available that could be written also if it would allow the
> filesystem/blockdev to optimise its layout.
> 
> Say, for example, I need to write a 3-byte change from a page, where that page
> is part of a 256K sequence in the pagecache.  Currently, I have to round the
> 3-bytes out to DIO size/alignment, but I could say to the API, for example,
> "here's a 256K iterator - I need bytes 225-227 written, but you can write more
> if you want to"?

I think you're optimising the wrong thing.  No actual storage lets you
write three bytes.  You're just pushing the read/modify/write cycle to
the remote end.  So you shouldn't even be tracking that three bytes have
been dirtied; you should be working in multiples of i_blocksize().

I don't know of any storage which lets you ask "can I optimise this
further for you by using a larger size".  Maybe we have some (software)
compressed storage which could do a better job if given a whole 256kB
block to recompress.

So it feels like you're both tracking dirty data at too fine a
granularity, and getting ahead of actual hardware capabilities by trying
to introduce a too-flexible API.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Could it be made possible to offer "supplementary" data to a DIO write ?
  2021-08-05 10:19 Could it be made possible to offer "supplementary" data to a DIO write ? David Howells
  2021-08-05 12:37 ` Matthew Wilcox
@ 2021-08-05 13:07 ` David Howells
  2021-08-05 13:35   ` Matthew Wilcox
  2021-08-05 14:38   ` David Howells
  1 sibling, 2 replies; 22+ messages in thread
From: David Howells @ 2021-08-05 13:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, linux-fsdevel, jlayton, Christoph Hellwig,
	Linus Torvalds, dchinner, linux-block, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> > Say, for example, I need to write a 3-byte change from a page, where that
> > page is part of a 256K sequence in the pagecache.  Currently, I have to
> > round the 3-bytes out to DIO size/alignment, but I could say to the API,
> > for example, "here's a 256K iterator - I need bytes 225-227 written, but
> > you can write more if you want to"?
> 
> I think you're optimising the wrong thing.  No actual storage lets you
> write three bytes.  You're just pushing the read/modify/write cycle to
> the remote end.  So you shouldn't even be tracking that three bytes have
> been dirtied; you should be working in multiples of i_blocksize().

I'm dealing with network filesystems that don't necessarily let you know what
i_blocksize is.  Assume it to be 1.

Further, only sending, say, 3 bytes and pushing RMW to the remote end is not
necessarily wrong for a network filesystem for at least two reasons: it
reduces the network loading and it reduces the effects of third-party write
collisions.

> I don't know of any storage which lets you ask "can I optimise this
> further for you by using a larger size".  Maybe we have some (software)
> compressed storage which could do a better job if given a whole 256kB
> block to recompress.

It would offer an extent-based filesystem the possibility of adjusting its
extent list.  And if you were mad enough to put your cache on a shingled
drive...  (though you'd probably need a much bigger block than 256K to make
that useful).  Also, jffs2 (if someone used that as a cache) can compress its
blocks.

> So it feels like you're both tracking dirty data at too fine a granularity,
> and getting ahead of actual hardware capabilities by trying to introduce a
> too-flexible API.

We might not know what the h/w caps are and there may be multiple destination
servers with different h/w caps involved.  Note that NFS and AFS in the kernel
both currently track at byte granularity and only send the bytes that changed.
The expense of setting up the write op on the server might actually outweigh
the RMW cycle.  With something like ceph, the server might actually have a
whole-object RMW/COW, say 4M.

Yet further, if your network fs has byte-range locks/leases and you have a
write lock/lease that ends part way into a page, when you drop that lock/lease
you shouldn't flush any data outside of that range lest you overwrite a range
that someone else has a lock/lease on.

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Could it be made possible to offer "supplementary" data to a DIO write ?
  2021-08-05 13:07 ` David Howells
@ 2021-08-05 13:35   ` Matthew Wilcox
  2021-08-05 14:38   ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2021-08-05 13:35 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, jlayton, Christoph Hellwig, Linus Torvalds,
	dchinner, linux-block, linux-kernel

On Thu, Aug 05, 2021 at 02:07:03PM +0100, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> > > Say, for example, I need to write a 3-byte change from a page, where that
> > > page is part of a 256K sequence in the pagecache.  Currently, I have to
> > > round the 3-bytes out to DIO size/alignment, but I could say to the API,
> > > for example, "here's a 256K iterator - I need bytes 225-227 written, but
> > > you can write more if you want to"?
> > 
> > I think you're optimising the wrong thing.  No actual storage lets you
> > write three bytes.  You're just pushing the read/modify/write cycle to
> > the remote end.  So you shouldn't even be tracking that three bytes have
> > been dirtied; you should be working in multiples of i_blocksize().
> 
> I'm dealing with network filesystems that don't necessarily let you know what
> i_blocksize is.  Assume it to be 1.

That's a really bad idea.  The overhead of tracking at byte level
granularity is just not worth it.

> Further, only sending, say, 3 bytes and pushing RMW to the remote end is not
> necessarily wrong for a network filesystem for at least two reasons: it
> reduces the network loading and it reduces the effects of third-party write
> collisions.

You can already get 400Gbit ethernet.  Saving 500 bytes by sending
just the 12 bytes that changed is optimising the wrong thing.  If you
have two clients accessing the same file at byte granularity, you've
already lost.

> > I don't know of any storage which lets you ask "can I optimise this
> > further for you by using a larger size".  Maybe we have some (software)
> > compressed storage which could do a better job if given a whole 256kB
> > block to recompress.
> 
> It would offer an extent-based filesystem the possibility of adjusting its
> extent list.  And if you were mad enough to put your cache on a shingled
> drive...  (though you'd probably need a much bigger block than 256K to make
> that useful).  Also, jffs2 (if someone used that as a cache) can compress its
> blocks.

Extent based filesystems create huge extents anyway:

$ /usr/sbin/xfs_bmap *.deb
linux-headers-5.14.0-rc1+_5.14.0-rc1+-1_amd64.deb:
	0: [0..16095]: 150008440..150024535
linux-image-5.14.0-rc1+_5.14.0-rc1+-1_amd64.deb:
	0: [0..383]: 149991824..149992207
	1: [384..103495]: 166567016..166670127
linux-image-5.14.0-rc1+-dbg_5.14.0-rc1+-1_amd64.deb:
	0: [0..183]: 149993016..149993199
	1: [184..1503623]: 763050936..764554375
linux-libc-dev_5.14.0-rc1+-1_amd64.deb:
	0: [0..2311]: 149979624..149981935

This has already happened when you initially wrote to the file backing
the cache.  Updates are just going to write to the already-allocated
blocks, unless you've done something utterly inappropriate to the
situation like reflinked the files.

> > So it feels like you're both tracking dirty data at too fine a granularity,
> > and getting ahead of actual hardware capabilities by trying to introduce a
> > too-flexible API.
> 
> We might not know what the h/w caps are and there may be multiple destination
> servers with different h/w caps involved.  Note that NFS and AFS in the kernel
> both currently track at byte granularity and only send the bytes that changed.
> The expense of setting up the write op on the server might actually outweigh
> the RMW cycle.  With something like ceph, the server might actually have a
> whole-object RMW/COW, say 4M.
> 
> Yet further, if your network fs has byte-range locks/leases and you have a
> write lock/lease that ends part way into a page, when you drop that lock/lease
> you shouldn't flush any data outside of that range lest you overwrite a range
> that someone else has a lock/lease on.

If you want to take leases at byte granularity, and then not writeback
parts of a page that are outside that lease, feel free.  It shouldn't
affect how you track dirtiness or how you writethrough the page cache
to the disk cache.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Could it be made possible to offer "supplementary" data to a DIO write ?
  2021-08-05 13:07 ` David Howells
  2021-08-05 13:35   ` Matthew Wilcox
@ 2021-08-05 14:38   ` David Howells
  2021-08-05 15:06     ` Matthew Wilcox
                       ` (3 more replies)
  1 sibling, 4 replies; 22+ messages in thread
From: David Howells @ 2021-08-05 14:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, linux-fsdevel, jlayton, Christoph Hellwig,
	Linus Torvalds, dchinner, linux-block, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> You can already get 400Gbit ethernet.

Sorry, but that's not likely to become relevant any time soon.  Besides, my
laptop's wifi doesn't really do that yet.

> Saving 500 bytes by sending just the 12 bytes that changed is optimising the
> wrong thing.

In one sense, at least, you're correct.  The cost of setting up an RPC to do
the write and setting up crypto is high compared to transmitting 3 bytes vs 4k
bytes.

> If you have two clients accessing the same file at byte granularity, you've
> already lost.

Doesn't stop people doing it, though.  People have sqlite, dbm, mail stores,
whatever in the homedirs from the desktop environments.  Granted, most of the
time people don't log in twice with the same homedir from two different
machines (and it doesn't - or didn't - used to work with Gnome or KDE).

> Extent based filesystems create huge extents anyway:

Okay, so it's not feasible.  That's fine.

> This has already happened when you initially wrote to the file backing
> the cache.  Updates are just going to write to the already-allocated
> blocks, unless you've done something utterly inappropriate to the
> situation like reflinked the files.

Or the file is being read random-access and we now have a block we didn't have
before that is contiguous to another block we already have.

> If you want to take leases at byte granularity, and then not writeback
> parts of a page that are outside that lease, feel free.  It shouldn't
> affect how you track dirtiness or how you writethrough the page cache
> to the disk cache.

Indeed.  Handling writes to the local disk cache is different from handling
writes to the server(s).  The cache has a larger block size but I don't have
to worry about third-party conflicts on it, whereas the server can be taken as
having no minimum block size, but my write can clash with someone else's.

Generally, I prefer to write back the minimum I can get away with (as does the
Linux NFS client AFAICT).

However, if everyone agrees that we should only ever write back a multiple of
a certain block size, even to network filesystems, what block size should that
be?  Note that PAGE_SIZE varies across arches and folios are going to
exacerbate this.  What I don't want to happen is that you read from a file, it
creates, say, a 4M (or larger) folio; you change three bytes and then you're
forced to write back the entire 4M folio.

Note that when content crypto or compression is employed, some multiple of the
size of the encrypted/compressed blocks would be a requirement.

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Could it be made possible to offer "supplementary" data to a DIO write ?
  2021-08-05 14:38   ` David Howells
@ 2021-08-05 15:06     ` Matthew Wilcox
  2021-08-05 15:38     ` David Howells
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2021-08-05 15:06 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, jlayton, Christoph Hellwig, Linus Torvalds,
	dchinner, linux-block, linux-kernel

On Thu, Aug 05, 2021 at 03:38:01PM +0100, David Howells wrote:
> > If you want to take leases at byte granularity, and then not writeback
> > parts of a page that are outside that lease, feel free.  It shouldn't
> > affect how you track dirtiness or how you writethrough the page cache
> > to the disk cache.
> 
> Indeed.  Handling writes to the local disk cache is different from handling
> writes to the server(s).  The cache has a larger block size but I don't have
> to worry about third-party conflicts on it, whereas the server can be taken as
> having no minimum block size, but my write can clash with someone else's.
> 
> Generally, I prefer to write back the minimum I can get away with (as does the
> Linux NFS client AFAICT).
> 
> However, if everyone agrees that we should only ever write back a multiple of
> a certain block size, even to network filesystems, what block size should that
> be?

If your network protocol doesn't give you a way to ask the server what
size it is, assume 512 bytes and allow it to be overridden by a mount
option.

> Note that PAGE_SIZE varies across arches and folios are going to
> exacerbate this.  What I don't want to happen is that you read from a file, it
> creates, say, a 4M (or larger) folio; you change three bytes and then you're
> forced to write back the entire 4M folio.

Actually, you do.  Two situations:

1. Application uses MADVISE_HUGEPAGE.  In response, we create a 2MB
page and mmap it aligned.  We use a PMD sized TLB entry and then the
CPU dirties a few bytes with a store.  There's no sub-TLB-entry tracking
of dirtiness.  It's just the whole 2MB.

2. The bigger the folio, the more writes it will absorb before being
written back.  So when you're writing back that 4MB folio, you're not
just servicing this 3 byte write, you're servicing every other write
which hit this 4MB chunk of the file.

There is one exception I've found, and that's O_SYNC writes.  These are
pretty rare, and I think I have a solution to it which essentially treats
the page cache as writethrough (for sync writes).  We skip marking
the page (folio) as dirty and go straight to marking it as writeback.
We have all the information we need about which bytes to write and we're
actually using the existing page cache infrastructure to do it.

I'm working on implementing that in iomap; there's some SMOP type
problems to solve, but it looks doable.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Could it be made possible to offer "supplementary" data to a DIO write ?
  2021-08-05 14:38   ` David Howells
  2021-08-05 15:06     ` Matthew Wilcox
@ 2021-08-05 15:38     ` David Howells
  2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
  2021-08-05 17:45     ` Could it be made possible to offer "supplementary" data to a DIO write ? Adam Borowski
  3 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-08-05 15:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, linux-fsdevel, jlayton, Christoph Hellwig,
	Linus Torvalds, dchinner, linux-block, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> > Note that PAGE_SIZE varies across arches and folios are going to
> > exacerbate this.  What I don't want to happen is that you read from a
> > file, it creates, say, a 4M (or larger) folio; you change three bytes and
> > then you're forced to write back the entire 4M folio.
> 
> Actually, you do.  Two situations:
> 
> 1. Application uses MADVISE_HUGEPAGE.  In response, we create a 2MB
> page and mmap it aligned.  We use a PMD sized TLB entry and then the
> CPU dirties a few bytes with a store.  There's no sub-TLB-entry tracking
> of dirtiness.  It's just the whole 2MB.

That's a special case.  The app specifically asked for it.  I'll grant with
mmap you have to mark a whole page as being dirty - but if you mmapped it, you
need to understand that's what will happen.

> 2. The bigger the folio, the more writes it will absorb before being
> written back.  So when you're writing back that 4MB folio, you're not
> just servicing this 3 byte write, you're servicing every other write
> which hit this 4MB chunk of the file.

You can argue it that way - but we already do it bytewise in some filesystems,
so what you want would necessitate a change of behaviour.

Note also that if the page size > max RPC payload size (1MB in NFS, I think),
you have to make multiple write operations to fulfil that writeback; further,
if you have an object-based system you might be making writes to multiple
servers, some of which will not actually make a change, to make that
writeback.

I wonder if this needs pushing onto the various network filesystem mailing
lists to find out what they want and why.

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Canvassing for network filesystem write size vs page size
  2021-08-05 14:38   ` David Howells
  2021-08-05 15:06     ` Matthew Wilcox
  2021-08-05 15:38     ` David Howells
@ 2021-08-05 16:35     ` David Howells
  2021-08-05 17:27         ` Linus Torvalds
                         ` (4 more replies)
  2021-08-05 17:45     ` Could it be made possible to offer "supplementary" data to a DIO write ? Adam Borowski
  3 siblings, 5 replies; 22+ messages in thread
From: David Howells @ 2021-08-05 16:35 UTC (permalink / raw)
  To: Anna Schumaker, Trond Myklebust, Jeff Layton, Steve French,
	Dominique Martinet, Mike Marshall, Miklos Szeredi
  Cc: dhowells, Matthew Wilcox (Oracle),
	Shyam Prasad N, Linus Torvalds, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, devel,
	linux-mm, linux-fsdevel, linux-kernel

With Willy's upcoming folio changes, from a filesystem point of view, we're
going to be looking at folios instead of pages, where:

 - a folio is a contiguous collection of pages;

 - each page in the folio might be standard PAGE_SIZE page (4K or 64K, say) or
   a huge pages (say 2M each);

 - a folio has one dirty flag and one writeback flag that applies to all
   constituent pages;

 - a complete folio currently is limited to PMD_SIZE or order 8, but could
   theoretically go up to about 2GiB before various integer fields have to be
   modified (not to mention the memory allocator).

Willy is arguing that network filesystems should, except in certain very
special situations (eg. O_SYNC), only write whole folios (limited to EOF).

Some network filesystems, however, currently keep track of which byte ranges
are modified within a dirty page (AFS does; NFS seems to also) and only write
out the modified data.

Also, there are limits to the maximum RPC payload sizes, so writing back large
pages may necessitate multiple writes, possibly to multiple servers.

What I'm trying to do is collate each network filesystem's properties (I'm
including FUSE in that).

So we have the following filesystems:

 Plan9
 - Doesn't track bytes
 - Only writes single pages

 AFS
 - Max RPC payload theoretically ~5.5 TiB (OpenAFS), ~16EiB (Auristor/kAFS)
 - kAFS (Linux kernel)
   - Tracks bytes, only writes back what changed
   - Writes from up to 65535 contiguous pages.
 - OpenAFS/Auristor (UNIX/Linux)
   - Deal with cache-sized blocks (configurable, but something from 8K to 2M),
     reads and writes in these blocks
 - OpenAFS/Auristor (Windows)
   - Track bytes, write back only what changed

 Ceph
 - File divided into objects (typically 2MiB in size), which may be scattered
   over multiple servers.
 - Max RPC size is therefore object size.
 - Doesn't track bytes.

 CIFS/SMB
 - Writes back just changed bytes immediately under some circumstances
 - Doesn't track bytes and writes back whole pages otherwise.
 - SMB3 has a max RPC size of 16MiB, with a default of 4MiB

 FUSE
 - Doesn't track bytes.
 - Max 'RPC' size of 256 pages (I think).

 NFS
 - Tracks modified bytes within a page.
 - Max RPC size of 1MiB.
 - Files may be constructed of objects scattered over different servers.

 OrangeFS
 - Doesn't track bytes.
 - Multipage writes possible.

If you could help me fill in the gaps, that would be great.

Thanks,
David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
@ 2021-08-05 17:27         ` Linus Torvalds
  2021-08-05 17:52       ` Adam Borowski
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2021-08-05 17:27 UTC (permalink / raw)
  To: David Howells
  Cc: Anna Schumaker, Trond Myklebust, Jeff Layton, Steve French,
	Dominique Martinet, Mike Marshall, Miklos Szeredi,
	Matthew Wilcox (Oracle),
	Shyam Prasad N, linux-cachefs, linux-afs, open list:NFS, SUNRPC,
	AND...,
	CIFS, ceph-devel, v9fs-developer, devel, Linux-MM, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Aug 5, 2021 at 9:36 AM David Howells <dhowells@redhat.com> wrote:
>
> Some network filesystems, however, currently keep track of which byte ranges
> are modified within a dirty page (AFS does; NFS seems to also) and only write
> out the modified data.

NFS definitely does. I haven't used NFS in two decades, but I worked
on some of the code (read: I made nfs use the page cache both for
reading and writing) back in my Transmeta days, because NFSv2 was the
default filesystem setup back then.

See fs/nfs/write.c, although I have to admit that I don't recognize
that code any more.

It's fairly important to be able to do streaming writes without having
to read the old contents for some loads. And read-modify-write cycles
are death for performance, so you really want to coalesce writes until
you have the whole page.

That said, I suspect it's also *very* filesystem-specific, to the
point where it might not be worth trying to do in some generic manner.

In particular, NFS had things like interesting credential issues, so
if you have multiple concurrent writers that used different 'struct
file *' to write to the file, you can't just mix the writes. You have
to sync the writes from one writer before you start the writes for the
next one, because one might succeed and the other not.

So you can't just treat it as some random "page cache with dirty byte
extents". You really have to be careful about credentials, timeouts,
etc, and the pending writes have to keep a fair amount of state
around.

At least that was the case two decades ago.

[ goes off and looks. See "nfs_write_begin()" and friends in
fs/nfs/file.c for some of the examples of these things, althjough it
looks like the code is less aggressive about avoding the
read-modify-write case than I thought I remembered, and only does it
for write-only opens ]

               Linus

            Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
@ 2021-08-05 17:27         ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2021-08-05 17:27 UTC (permalink / raw)
  To: David Howells
  Cc: Anna Schumaker, Trond Myklebust, Jeff Layton, Steve French,
	Dominique Martinet, Mike Marshall, Miklos Szeredi,
	Matthew Wilcox (Oracle),
	Shyam Prasad N, linux-cachefs, linux-afs, open list:NFS, SUNRPC,
	AND...,
	CIFS, ceph-devel, v9fs-developer, devel, Linux-MM, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Aug 5, 2021 at 9:36 AM David Howells <dhowells@redhat.com> wrote:
>
> Some network filesystems, however, currently keep track of which byte ranges
> are modified within a dirty page (AFS does; NFS seems to also) and only write
> out the modified data.

NFS definitely does. I haven't used NFS in two decades, but I worked
on some of the code (read: I made nfs use the page cache both for
reading and writing) back in my Transmeta days, because NFSv2 was the
default filesystem setup back then.

See fs/nfs/write.c, although I have to admit that I don't recognize
that code any more.

It's fairly important to be able to do streaming writes without having
to read the old contents for some loads. And read-modify-write cycles
are death for performance, so you really want to coalesce writes until
you have the whole page.

That said, I suspect it's also *very* filesystem-specific, to the
point where it might not be worth trying to do in some generic manner.

In particular, NFS had things like interesting credential issues, so
if you have multiple concurrent writers that used different 'struct
file *' to write to the file, you can't just mix the writes. You have
to sync the writes from one writer before you start the writes for the
next one, because one might succeed and the other not.

So you can't just treat it as some random "page cache with dirty byte
extents". You really have to be careful about credentials, timeouts,
etc, and the pending writes have to keep a fair amount of state
around.

At least that was the case two decades ago.

[ goes off and looks. See "nfs_write_begin()" and friends in
fs/nfs/file.c for some of the examples of these things, althjough it
looks like the code is less aggressive about avoding the
read-modify-write case than I thought I remembered, and only does it
for write-only opens ]

               Linus

            Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-05 17:27         ` Linus Torvalds
@ 2021-08-05 17:43           ` Trond Myklebust
  -1 siblings, 0 replies; 22+ messages in thread
From: Trond Myklebust @ 2021-08-05 17:43 UTC (permalink / raw)
  To: torvalds, dhowells
  Cc: jlayton, willy, nspmangalore, linux-cifs, linux-kernel, asmadeus,
	anna.schumaker, linux-mm, miklos, linux-nfs, linux-afs, devel,
	hubcap, linux-cachefs, linux-fsdevel, sfrench, ceph-devel,
	v9fs-developer

On Thu, 2021-08-05 at 10:27 -0700, Linus Torvalds wrote:
> On Thu, Aug 5, 2021 at 9:36 AM David Howells <dhowells@redhat.com>
> wrote:
> > 
> > Some network filesystems, however, currently keep track of which
> > byte ranges
> > are modified within a dirty page (AFS does; NFS seems to also) and
> > only write
> > out the modified data.
> 
> NFS definitely does. I haven't used NFS in two decades, but I worked
> on some of the code (read: I made nfs use the page cache both for
> reading and writing) back in my Transmeta days, because NFSv2 was the
> default filesystem setup back then.
> 
> See fs/nfs/write.c, although I have to admit that I don't recognize
> that code any more.
> 
> It's fairly important to be able to do streaming writes without
> having
> to read the old contents for some loads. And read-modify-write cycles
> are death for performance, so you really want to coalesce writes
> until
> you have the whole page.
> 
> That said, I suspect it's also *very* filesystem-specific, to the
> point where it might not be worth trying to do in some generic
> manner.
> 
> In particular, NFS had things like interesting credential issues, so
> if you have multiple concurrent writers that used different 'struct
> file *' to write to the file, you can't just mix the writes. You have
> to sync the writes from one writer before you start the writes for
> the
> next one, because one might succeed and the other not.
> 
> So you can't just treat it as some random "page cache with dirty byte
> extents". You really have to be careful about credentials, timeouts,
> etc, and the pending writes have to keep a fair amount of state
> around.
> 
> At least that was the case two decades ago.
> 
> [ goes off and looks. See "nfs_write_begin()" and friends in
> fs/nfs/file.c for some of the examples of these things, althjough it
> looks like the code is less aggressive about avoding the
> read-modify-write case than I thought I remembered, and only does it
> for write-only opens ]
> 

All correct, however there is also the issue that even if we have done
a read-modify-write, we can't always extend the write to cover the
entire page.

If you look at nfs_can_extend_write(), you'll note that we don't extend
the page data if the file is range locked, if the attributes have not
been revalidated, or if the page cache contents are suspected to be
invalid for some other reason.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
@ 2021-08-05 17:43           ` Trond Myklebust
  0 siblings, 0 replies; 22+ messages in thread
From: Trond Myklebust @ 2021-08-05 17:43 UTC (permalink / raw)
  To: torvalds, dhowells
  Cc: jlayton, willy, nspmangalore, linux-cifs, linux-kernel, asmadeus,
	anna.schumaker, linux-mm, miklos, linux-nfs, linux-afs, devel,
	hubcap, linux-cachefs, linux-fsdevel, sfrench, ceph-devel,
	v9fs-developer

On Thu, 2021-08-05 at 10:27 -0700, Linus Torvalds wrote:
> On Thu, Aug 5, 2021 at 9:36 AM David Howells <dhowells@redhat.com>
> wrote:
> > 
> > Some network filesystems, however, currently keep track of which
> > byte ranges
> > are modified within a dirty page (AFS does; NFS seems to also) and
> > only write
> > out the modified data.
> 
> NFS definitely does. I haven't used NFS in two decades, but I worked
> on some of the code (read: I made nfs use the page cache both for
> reading and writing) back in my Transmeta days, because NFSv2 was the
> default filesystem setup back then.
> 
> See fs/nfs/write.c, although I have to admit that I don't recognize
> that code any more.
> 
> It's fairly important to be able to do streaming writes without
> having
> to read the old contents for some loads. And read-modify-write cycles
> are death for performance, so you really want to coalesce writes
> until
> you have the whole page.
> 
> That said, I suspect it's also *very* filesystem-specific, to the
> point where it might not be worth trying to do in some generic
> manner.
> 
> In particular, NFS had things like interesting credential issues, so
> if you have multiple concurrent writers that used different 'struct
> file *' to write to the file, you can't just mix the writes. You have
> to sync the writes from one writer before you start the writes for
> the
> next one, because one might succeed and the other not.
> 
> So you can't just treat it as some random "page cache with dirty byte
> extents". You really have to be careful about credentials, timeouts,
> etc, and the pending writes have to keep a fair amount of state
> around.
> 
> At least that was the case two decades ago.
> 
> [ goes off and looks. See "nfs_write_begin()" and friends in
> fs/nfs/file.c for some of the examples of these things, althjough it
> looks like the code is less aggressive about avoding the
> read-modify-write case than I thought I remembered, and only does it
> for write-only opens ]
> 

All correct, however there is also the issue that even if we have done
a read-modify-write, we can't always extend the write to cover the
entire page.

If you look at nfs_can_extend_write(), you'll note that we don't extend
the page data if the file is range locked, if the attributes have not
been revalidated, or if the page cache contents are suspected to be
invalid for some other reason.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Could it be made possible to offer "supplementary" data to a DIO write ?
  2021-08-05 14:38   ` David Howells
                       ` (2 preceding siblings ...)
  2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
@ 2021-08-05 17:45     ` Adam Borowski
  3 siblings, 0 replies; 22+ messages in thread
From: Adam Borowski @ 2021-08-05 17:45 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, linux-fsdevel, jlayton, Christoph Hellwig,
	Linus Torvalds, dchinner, linux-block, linux-kernel

On Thu, Aug 05, 2021 at 03:38:01PM +0100, David Howells wrote:
> Generally, I prefer to write back the minimum I can get away with (as does the
> Linux NFS client AFAICT).
> 
> However, if everyone agrees that we should only ever write back a multiple of
> a certain block size, even to network filesystems, what block size should that
> be?  Note that PAGE_SIZE varies across arches and folios are going to
> exacerbate this.  What I don't want to happen is that you read from a file, it
> creates, say, a 4M (or larger) folio; you change three bytes and then you're
> forced to write back the entire 4M folio.

grep . /sys/class/block/*/queue/minimum_io_size
and also hw_sector_size, logical_block_size, physical_block_size.

The data seems suspect to me, though.  I get 4096 for a spinner (looks
sane), 512 for nvme (less than page size), and 4096 for pmem (I'd expect
cacheline or ECC block).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
  2021-08-05 17:27         ` Linus Torvalds
@ 2021-08-05 17:52       ` Adam Borowski
  2021-08-05 18:50         ` Jeff Layton
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Adam Borowski @ 2021-08-05 17:52 UTC (permalink / raw)
  To: David Howells
  Cc: Anna Schumaker, Trond Myklebust, Jeff Layton, Steve French,
	Dominique Martinet, Mike Marshall, Miklos Szeredi,
	Matthew Wilcox (Oracle),
	Shyam Prasad N, Linus Torvalds, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, devel,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 05, 2021 at 05:35:33PM +0100, David Howells wrote:
>  - a complete folio currently is limited to PMD_SIZE or order 8, but could
>    theoretically go up to about 2GiB before various integer fields have to be
>    modified (not to mention the memory allocator).

No support for riscv 512GB pages? :p


-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ The ill-thought conversion to time64_t will make us suffer from
⢿⡄⠘⠷⠚⠋⠀ the Y292B problem.  So let's move the Epoch by 435451400064000000
⠈⠳⣄⠀⠀⠀⠀ and make it unsigned -- that'll almost double the range.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
@ 2021-08-05 18:50         ` Jeff Layton
  2021-08-05 17:52       ` Adam Borowski
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2021-08-05 18:50 UTC (permalink / raw)
  To: David Howells, Anna Schumaker, Trond Myklebust, Steve French,
	Dominique Martinet, Mike Marshall, Miklos Szeredi
  Cc: Matthew Wilcox (Oracle),
	Shyam Prasad N, Linus Torvalds, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, devel,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, 2021-08-05 at 17:35 +0100, David Howells wrote:
> With Willy's upcoming folio changes, from a filesystem point of view, we're
> going to be looking at folios instead of pages, where:
> 
>  - a folio is a contiguous collection of pages;
> 
>  - each page in the folio might be standard PAGE_SIZE page (4K or 64K, say) or
>    a huge pages (say 2M each);
> 
>  - a folio has one dirty flag and one writeback flag that applies to all
>    constituent pages;
> 
>  - a complete folio currently is limited to PMD_SIZE or order 8, but could
>    theoretically go up to about 2GiB before various integer fields have to be
>    modified (not to mention the memory allocator).
> 
> Willy is arguing that network filesystems should, except in certain very
> special situations (eg. O_SYNC), only write whole folios (limited to EOF).
> 
> Some network filesystems, however, currently keep track of which byte ranges
> are modified within a dirty page (AFS does; NFS seems to also) and only write
> out the modified data.
> 
> Also, there are limits to the maximum RPC payload sizes, so writing back large
> pages may necessitate multiple writes, possibly to multiple servers.
> 
> What I'm trying to do is collate each network filesystem's properties (I'm
> including FUSE in that).
> 
> So we have the following filesystems:
> 
>  Plan9
>  - Doesn't track bytes
>  - Only writes single pages
> 
>  AFS
>  - Max RPC payload theoretically ~5.5 TiB (OpenAFS), ~16EiB (Auristor/kAFS)
>  - kAFS (Linux kernel)
>    - Tracks bytes, only writes back what changed
>    - Writes from up to 65535 contiguous pages.
>  - OpenAFS/Auristor (UNIX/Linux)
>    - Deal with cache-sized blocks (configurable, but something from 8K to 2M),
>      reads and writes in these blocks
>  - OpenAFS/Auristor (Windows)
>    - Track bytes, write back only what changed
> 
>  Ceph
>  - File divided into objects (typically 2MiB in size), which may be scattered
>    over multiple servers.

The default is 4M in modern cephfs clusters, but the rest is correct.

>  - Max RPC size is therefore object size.
>  - Doesn't track bytes.
> 
>  CIFS/SMB
>  - Writes back just changed bytes immediately under some circumstances

cifs.ko can also just do writes to specific byte ranges synchronously
when it doesn't have the ability to use the cache (i.e. no oplock or
lease). CephFS also does this when it doesn't have the necessary
capabilities (aka caps) to use the pagecache.

If we want to add infrastructure for netfs writeback, then it would be
nice to consider similar infrastructure to handle those cases as well.

>  - Doesn't track bytes and writes back whole pages otherwise.
>  - SMB3 has a max RPC size of 16MiB, with a default of 4MiB
> 
>  FUSE
>  - Doesn't track bytes.
>  - Max 'RPC' size of 256 pages (I think).
> 
>  NFS
>  - Tracks modified bytes within a page.
>  - Max RPC size of 1MiB.
>  - Files may be constructed of objects scattered over different servers.
> 
>  OrangeFS
>  - Doesn't track bytes.
>  - Multipage writes possible.
> 
> If you could help me fill in the gaps, that would be great.


-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
@ 2021-08-05 18:50         ` Jeff Layton
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2021-08-05 18:50 UTC (permalink / raw)
  To: David Howells, Anna Schumaker, Trond Myklebust, Steve French,
	Dominique Martinet, Mike Marshall, Miklos Szeredi
  Cc: Matthew Wilcox (Oracle),
	Shyam Prasad N, Linus Torvalds, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, devel,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, 2021-08-05 at 17:35 +0100, David Howells wrote:
> With Willy's upcoming folio changes, from a filesystem point of view, we're
> going to be looking at folios instead of pages, where:
> 
>  - a folio is a contiguous collection of pages;
> 
>  - each page in the folio might be standard PAGE_SIZE page (4K or 64K, say) or
>    a huge pages (say 2M each);
> 
>  - a folio has one dirty flag and one writeback flag that applies to all
>    constituent pages;
> 
>  - a complete folio currently is limited to PMD_SIZE or order 8, but could
>    theoretically go up to about 2GiB before various integer fields have to be
>    modified (not to mention the memory allocator).
> 
> Willy is arguing that network filesystems should, except in certain very
> special situations (eg. O_SYNC), only write whole folios (limited to EOF).
> 
> Some network filesystems, however, currently keep track of which byte ranges
> are modified within a dirty page (AFS does; NFS seems to also) and only write
> out the modified data.
> 
> Also, there are limits to the maximum RPC payload sizes, so writing back large
> pages may necessitate multiple writes, possibly to multiple servers.
> 
> What I'm trying to do is collate each network filesystem's properties (I'm
> including FUSE in that).
> 
> So we have the following filesystems:
> 
>  Plan9
>  - Doesn't track bytes
>  - Only writes single pages
> 
>  AFS
>  - Max RPC payload theoretically ~5.5 TiB (OpenAFS), ~16EiB (Auristor/kAFS)
>  - kAFS (Linux kernel)
>    - Tracks bytes, only writes back what changed
>    - Writes from up to 65535 contiguous pages.
>  - OpenAFS/Auristor (UNIX/Linux)
>    - Deal with cache-sized blocks (configurable, but something from 8K to 2M),
>      reads and writes in these blocks
>  - OpenAFS/Auristor (Windows)
>    - Track bytes, write back only what changed
> 
>  Ceph
>  - File divided into objects (typically 2MiB in size), which may be scattered
>    over multiple servers.

The default is 4M in modern cephfs clusters, but the rest is correct.

>  - Max RPC size is therefore object size.
>  - Doesn't track bytes.
> 
>  CIFS/SMB
>  - Writes back just changed bytes immediately under some circumstances

cifs.ko can also just do writes to specific byte ranges synchronously
when it doesn't have the ability to use the cache (i.e. no oplock or
lease). CephFS also does this when it doesn't have the necessary
capabilities (aka caps) to use the pagecache.

If we want to add infrastructure for netfs writeback, then it would be
nice to consider similar infrastructure to handle those cases as well.

>  - Doesn't track bytes and writes back whole pages otherwise.
>  - SMB3 has a max RPC size of 16MiB, with a default of 4MiB
> 
>  FUSE
>  - Doesn't track bytes.
>  - Max 'RPC' size of 256 pages (I think).
> 
>  NFS
>  - Tracks modified bytes within a page.
>  - Max RPC size of 1MiB.
>  - Files may be constructed of objects scattered over different servers.
> 
>  OrangeFS
>  - Doesn't track bytes.
>  - Multipage writes possible.
> 
> If you could help me fill in the gaps, that would be great.


-- 
Jeff Layton <jlayton@redhat.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-05 17:27         ` Linus Torvalds
  (?)
  (?)
@ 2021-08-05 22:11         ` Matthew Wilcox
  -1 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2021-08-05 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Anna Schumaker, Trond Myklebust, Jeff Layton,
	Steve French, Dominique Martinet, Mike Marshall, Miklos Szeredi,
	Shyam Prasad N, linux-cachefs, linux-afs, open list:NFS, SUNRPC,
	AND...,
	CIFS, ceph-devel, v9fs-developer, devel, Linux-MM, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Aug 05, 2021 at 10:27:05AM -0700, Linus Torvalds wrote:
> On Thu, Aug 5, 2021 at 9:36 AM David Howells <dhowells@redhat.com> wrote:
> > Some network filesystems, however, currently keep track of which byte ranges
> > are modified within a dirty page (AFS does; NFS seems to also) and only write
> > out the modified data.
> 
> NFS definitely does. I haven't used NFS in two decades, but I worked
> on some of the code (read: I made nfs use the page cache both for
> reading and writing) back in my Transmeta days, because NFSv2 was the
> default filesystem setup back then.
> 
> See fs/nfs/write.c, although I have to admit that I don't recognize
> that code any more.
> 
> It's fairly important to be able to do streaming writes without having
> to read the old contents for some loads. And read-modify-write cycles
> are death for performance, so you really want to coalesce writes until
> you have the whole page.

I completely agree with you.  The context you're missing is that Dave
wants to do RMW twice.  He doesn't do the delaying SetPageUptodate dance.
If the write is less than the whole page, AFS, Ceph and anybody else
using netfs_write_begin() will first read the entire page in and mark
it Uptodate.

Then he wants to track which parts of the page are dirty (at byte
granularity) and send only those bytes to the server in a write request.
So it's worst of both worlds; first the client does an RMW, then the
server does an RMW (assuming the client's data is no longer in the
server's cache.

The NFS code moves the RMW from the client to the server, and that makes
a load of sense.

> That said, I suspect it's also *very* filesystem-specific, to the
> point where it might not be worth trying to do in some generic manner.

It certainly doesn't make sense for block filesystems.  Since they
can only do I/O on block boundaries, a sub-block write has to read in
the surrounding block, and once you're doing that, you might as well
read in the whole page.

Tracking sub-page dirty bits still makes sense.  It's on my to-do
list for iomap.

> [ goes off and looks. See "nfs_write_begin()" and friends in
> fs/nfs/file.c for some of the examples of these things, althjough it
> looks like the code is less aggressive about avoding the
> read-modify-write case than I thought I remembered, and only does it
> for write-only opens ]

NFS is missing one trick; it could implement aops->is_partially_uptodate
and then it would be able to read back bytes that have already been
written by this client without writing back the dirty ranges and fetching
the page from the server.

Maybe this isn't an important optimisation.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
                         ` (2 preceding siblings ...)
  2021-08-05 18:50         ` Jeff Layton
@ 2021-08-05 23:47       ` Matthew Wilcox
  2021-08-06 13:44       ` David Howells
  4 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2021-08-05 23:47 UTC (permalink / raw)
  To: David Howells
  Cc: Anna Schumaker, Trond Myklebust, Jeff Layton, Steve French,
	Dominique Martinet, Mike Marshall, Miklos Szeredi,
	Shyam Prasad N, Linus Torvalds, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, devel,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 05, 2021 at 05:35:33PM +0100, David Howells wrote:
> With Willy's upcoming folio changes, from a filesystem point of view, we're
> going to be looking at folios instead of pages, where:
> 
>  - a folio is a contiguous collection of pages;
> 
>  - each page in the folio might be standard PAGE_SIZE page (4K or 64K, say) or
>    a huge pages (say 2M each);

This is not a great way to explain folios.

If you're familiar with compound pages, a folio is a new type for
either a base page or the head page of a compound page; nothing more
and nothing less.

If you're not familiar with compound pages, a folio contains 2^n
contiguous pages.  They are treated as a single unit.

>  - a folio has one dirty flag and one writeback flag that applies to all
>    constituent pages;
> 
>  - a complete folio currently is limited to PMD_SIZE or order 8, but could
>    theoretically go up to about 2GiB before various integer fields have to be
>    modified (not to mention the memory allocator).

Filesystems should not make an assumption about this ... I suspect
the optimum page size scales with I/O bandwidth; taking PCI bandwidth
as a reasonable proxy, it's doubled five times in twenty years.

> Willy is arguing that network filesystems should, except in certain very
> special situations (eg. O_SYNC), only write whole folios (limited to EOF).

I did also say that the write could be limited by, eg, a byte-range
lease on the file.  If the client doesn't have permission to write
a byte range, then it doesn't need to write it back.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-05 17:27         ` Linus Torvalds
                           ` (2 preceding siblings ...)
  (?)
@ 2021-08-06 13:42         ` David Howells
  2021-08-06 14:17           ` Matthew Wilcox
  2021-08-06 15:04           ` David Howells
  -1 siblings, 2 replies; 22+ messages in thread
From: David Howells @ 2021-08-06 13:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Linus Torvalds, Anna Schumaker, Trond Myklebust,
	Jeff Layton, Steve French, Dominique Martinet, Mike Marshall,
	Miklos Szeredi, Shyam Prasad N, linux-cachefs, linux-afs,
	open list:NFS, SUNRPC, AND...,
	CIFS, ceph-devel, v9fs-developer, devel, Linux-MM, linux-fsdevel,
	Linux Kernel Mailing List

Matthew Wilcox <willy@infradead.org> wrote:

> > It's fairly important to be able to do streaming writes without having
> > to read the old contents for some loads. And read-modify-write cycles
> > are death for performance, so you really want to coalesce writes until
> > you have the whole page.
> 
> I completely agree with you.  The context you're missing is that Dave
> wants to do RMW twice.  He doesn't do the delaying SetPageUptodate dance.

Actually, I do the delaying of SetPageUptodate in the new write helpers that
I'm working on - at least to some extent.  For a write of any particular size
(which may be more than a page), I only read the first and last pages affected
if they're not completely changed by the write.  Note that I have my own
version of generic_perform_write() that allows me to eliminate write_begin and
write_end for any filesystem using it.

Keeping track of which regions are dirty allows merging of contiguous dirty
regions.

It has occurred to me that I don't actually need the pages to be uptodate and
completely filled out.  I'm tracking which bits are dirty - I could defer
reading the missing bits till someone wants to read or mmap.

But that kind of screws with local caching.  The local cache might need to
track the missing bits, and we are likely to be using blocks larger than a
page.

Basically, there are a lot of scenarios where not having fully populated pages
sucks.  And for streaming writes, wouldn't it be better if you used DIO
writes?

> If the write is less than the whole page, AFS, Ceph and anybody else
> using netfs_write_begin() will first read the entire page in and mark
> it Uptodate.

Indeed - but that function is set to be replaced.  What you're missing is that
if someone then tries to read the partially modified page, you may have to do
two reads from the server.

> Then he wants to track which parts of the page are dirty (at byte
> granularity) and send only those bytes to the server in a write request.

Yes.  Because other constraints may apply, for example the handling of
conflicting third-party writes.  The question here is how much we care about
that - and that's why I'm trying to write back only what's changed where
possible.

That said, if content encryption is thrown into the mix, the minimum we can
write back is whatever the size of the blocks on which encryption is
performed, so maybe we shouldn't care.

Add disconnected operation reconnection resolution, where it might be handy to
have a list of what changed on a file.

> So it's worst of both worlds; first the client does an RMW, then the
> server does an RMW (assuming the client's data is no longer in the
> server's cache.

Actually, it's not necessarily what you make out.  You have to compare the
server-side RMW with cost of setting up a read or a write operation.

And then there's this scenario:  Imagine I'm going to modify the middle of a
page which doesn't yet exist.  I read the bit at the beginning and the bit at
the end and then try to fill the middle, but now get an EFAULT error.  I'm
going to have to do *three* reads if someone wants to read the page.

> The NFS code moves the RMW from the client to the server, and that makes
> a load of sense.

No, it very much depends.  It might suck if you have the folio partly cached
locally in fscache, and it doesn't work if you have content encryption and
would suck if you're doing disconnected operation.

I presume you're advocating that the change is immediately written to the
server, and then you read it back from the server?

> > That said, I suspect it's also *very* filesystem-specific, to the
> > point where it might not be worth trying to do in some generic manner.
> 
> It certainly doesn't make sense for block filesystems.  Since they
> can only do I/O on block boundaries, a sub-block write has to read in
> the surrounding block, and once you're doing that, you might as well
> read in the whole page.

I'm not trying to do this for block filesystems!  However, a block filesystem
- or even a blockdev - might be involved in terms of the local cache.

> Tracking sub-page dirty bits still makes sense.  It's on my to-do
> list for iomap.

/me blinks

"bits" as in parts of a page or "bits" as in the PG_dirty bits on the pages
contributing to a folio?

> > [ goes off and looks. See "nfs_write_begin()" and friends in
> > fs/nfs/file.c for some of the examples of these things, althjough it
> > looks like the code is less aggressive about avoding the
> > read-modify-write case than I thought I remembered, and only does it
> > for write-only opens ]
> 
> NFS is missing one trick; it could implement aops->is_partially_uptodate
> and then it would be able to read back bytes that have already been
> written by this client without writing back the dirty ranges and fetching
> the page from the server.

As mentioned above, I have been considering the possibility of keeping track
of partially dirty non-uptodate pages.  Jeff and I have been discussing that
we might want support for explicit RMW anyway for various reasons (e.g. doing
DIO that's not crypto-block aligned,
remote-invalidation/reconnection-resolution handling).

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
                         ` (3 preceding siblings ...)
  2021-08-05 23:47       ` Matthew Wilcox
@ 2021-08-06 13:44       ` David Howells
  4 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-08-06 13:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Anna Schumaker, Trond Myklebust, Jeff Layton,
	Steve French, Dominique Martinet, Mike Marshall, Miklos Szeredi,
	Shyam Prasad N, Linus Torvalds, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, devel,
	linux-mm, linux-fsdevel, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> Filesystems should not make an assumption about this ... I suspect
> the optimum page size scales with I/O bandwidth; taking PCI bandwidth
> as a reasonable proxy, it's doubled five times in twenty years.

There are a lot more factors than you make out.  Local caching, content
crypto, transport crypto, cost of setting up RPC calls, compounding calls to
multiple servers.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-06 13:42         ` David Howells
@ 2021-08-06 14:17           ` Matthew Wilcox
  2021-08-06 15:04           ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2021-08-06 14:17 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Anna Schumaker, Trond Myklebust, Jeff Layton,
	Steve French, Dominique Martinet, Mike Marshall, Miklos Szeredi,
	Shyam Prasad N, linux-cachefs, linux-afs, open list:NFS, SUNRPC,
	AND...,
	CIFS, ceph-devel, v9fs-developer, devel, Linux-MM, linux-fsdevel,
	Linux Kernel Mailing List

On Fri, Aug 06, 2021 at 02:42:37PM +0100, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > > It's fairly important to be able to do streaming writes without having
> > > to read the old contents for some loads. And read-modify-write cycles
> > > are death for performance, so you really want to coalesce writes until
> > > you have the whole page.
> > 
> > I completely agree with you.  The context you're missing is that Dave
> > wants to do RMW twice.  He doesn't do the delaying SetPageUptodate dance.
> 
> Actually, I do the delaying of SetPageUptodate in the new write helpers that
> I'm working on - at least to some extent.  For a write of any particular size
> (which may be more than a page), I only read the first and last pages affected
> if they're not completely changed by the write.  Note that I have my own
> version of generic_perform_write() that allows me to eliminate write_begin and
> write_end for any filesystem using it.

No, that is very much not the same thing.  Look at what NFS does, like
Linus said.  Consider this test program:

	fd = open();
	lseek(fd, 5, SEEK_SET);
	write(fd, buf, 3);
	write(fd, buf2, 10);
	write(fd, buf3, 2);
	close(fd);

You're going to do an RMW.  NFS keeps track of which bytes are dirty,
and writes only those bytes to the server (when that page is eventually
written-back).  So yes, it's using the page cache, but it's not doing
an unnecessary read from the server.

> It has occurred to me that I don't actually need the pages to be uptodate and
> completely filled out.  I'm tracking which bits are dirty - I could defer
> reading the missing bits till someone wants to read or mmap.
> 
> But that kind of screws with local caching.  The local cache might need to
> track the missing bits, and we are likely to be using blocks larger than a
> page.

There's nothing to cache.  Pages which are !Uptodate aren't going to get
locally cached.

> Basically, there are a lot of scenarios where not having fully populated pages
> sucks.  And for streaming writes, wouldn't it be better if you used DIO
> writes?

DIO can't do sub-512-byte writes.

> > If the write is less than the whole page, AFS, Ceph and anybody else
> > using netfs_write_begin() will first read the entire page in and mark
> > it Uptodate.
> 
> Indeed - but that function is set to be replaced.  What you're missing is that
> if someone then tries to read the partially modified page, you may have to do
> two reads from the server.

NFS doesn't.  It writes back the dirty data from the page and then
does a single read of the entire page.  And as I said later on, using
->is_partially_uptodate can avoid that for some cases.

> > Then he wants to track which parts of the page are dirty (at byte
> > granularity) and send only those bytes to the server in a write request.
> 
> Yes.  Because other constraints may apply, for example the handling of
> conflicting third-party writes.  The question here is how much we care about
> that - and that's why I'm trying to write back only what's changed where
> possible.

If you care about conflicting writes from different clients, you really
need to establish a cache ownership model.  Or treat the page-cache as
write-through.

> > > That said, I suspect it's also *very* filesystem-specific, to the
> > > point where it might not be worth trying to do in some generic manner.
> > 
> > It certainly doesn't make sense for block filesystems.  Since they
> > can only do I/O on block boundaries, a sub-block write has to read in
> > the surrounding block, and once you're doing that, you might as well
> > read in the whole page.
> 
> I'm not trying to do this for block filesystems!  However, a block filesystem
> - or even a blockdev - might be involved in terms of the local cache.

You might not be trying to do anything for block filesystems, but we
should think about what makes sense for block filesystems as well as
network filesystems.

> > Tracking sub-page dirty bits still makes sense.  It's on my to-do
> > list for iomap.
> 
> /me blinks
> 
> "bits" as in parts of a page or "bits" as in the PG_dirty bits on the pages
> contributing to a folio?

Perhaps I should have said "Tracking dirtiness on a sub-page basis".
Right now, that looks like a block bitmap, but maybe it should be a
range-based data structure.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Canvassing for network filesystem write size vs page size
  2021-08-06 13:42         ` David Howells
  2021-08-06 14:17           ` Matthew Wilcox
@ 2021-08-06 15:04           ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: David Howells @ 2021-08-06 15:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Linus Torvalds, Anna Schumaker, Trond Myklebust,
	Jeff Layton, Steve French, Dominique Martinet, Mike Marshall,
	Miklos Szeredi, Shyam Prasad N, linux-cachefs, linux-afs,
	open list:NFS, SUNRPC, AND...,
	CIFS, ceph-devel, v9fs-developer, devel, Linux-MM, linux-fsdevel,
	Linux Kernel Mailing List

Matthew Wilcox <willy@infradead.org> wrote:

> No, that is very much not the same thing.  Look at what NFS does, like
> Linus said.  Consider this test program:
> 
> 	fd = open();
> 	lseek(fd, 5, SEEK_SET);
> 	write(fd, buf, 3);
> 	write(fd, buf2, 10);
> 	write(fd, buf3, 2);
> 	close(fd);

Yes, I get that.  I can do that when there isn't a local cache or content
encryption.

Note that, currently, if the pages (or cache blocks) being read/modified are
beyond the EOF at the point when the file is opened, truncated down or last
subject to 3rd-party invalidation, I don't go to the server at all.

> > But that kind of screws with local caching.  The local cache might need to
> > track the missing bits, and we are likely to be using blocks larger than a
> > page.
> 
> There's nothing to cache.  Pages which are !Uptodate aren't going to get
> locally cached.

Eh?  Of course there is.  You've just written some data.  That need to get
copied to the cache as well as the server if that file is supposed to be being
cached (for filesystems that support local caching of files open for writing,
which AFS does).

> > Basically, there are a lot of scenarios where not having fully populated
> > pages sucks.  And for streaming writes, wouldn't it be better if you used
> > DIO writes?
> 
> DIO can't do sub-512-byte writes.

Yes it can - and it works for my AFS client at least with the patches in my
fscache-iter-2 branch.  This is mainly a restriction for block storage devices
we're doing DMA to - but we're not doing direct DMA to block storage devices
typically when talking to a network filesystem.

For AFS, at least, I can just make one big FetchData/StoreData RPC that
reads/writes the entire DIO request in a single op; for other filesystems
(NFS, ceph for example), it needs breaking up into a sequence of RPCs, but
there's no particular reason that I know of that requires it to be 512-byte
aligned on any of these.

Things get more interesting if you're doing DIO to a content-encrypted file
because the block size may be 4096 or even a lot larger - in which case we
would have to do local RMW to handle misaligned writes, but it presents no
particular difficulty.

> You might not be trying to do anything for block filesystems, but we
> should think about what makes sense for block filesystems as well as
> network filesystems.

Whilst that's a good principle, they have very different characteristics that
might make that difficult.

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-08-06 15:05 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05 10:19 Could it be made possible to offer "supplementary" data to a DIO write ? David Howells
2021-08-05 12:37 ` Matthew Wilcox
2021-08-05 13:07 ` David Howells
2021-08-05 13:35   ` Matthew Wilcox
2021-08-05 14:38   ` David Howells
2021-08-05 15:06     ` Matthew Wilcox
2021-08-05 15:38     ` David Howells
2021-08-05 16:35     ` Canvassing for network filesystem write size vs page size David Howells
2021-08-05 17:27       ` Linus Torvalds
2021-08-05 17:27         ` Linus Torvalds
2021-08-05 17:43         ` Trond Myklebust
2021-08-05 17:43           ` Trond Myklebust
2021-08-05 22:11         ` Matthew Wilcox
2021-08-06 13:42         ` David Howells
2021-08-06 14:17           ` Matthew Wilcox
2021-08-06 15:04           ` David Howells
2021-08-05 17:52       ` Adam Borowski
2021-08-05 18:50       ` Jeff Layton
2021-08-05 18:50         ` Jeff Layton
2021-08-05 23:47       ` Matthew Wilcox
2021-08-06 13:44       ` David Howells
2021-08-05 17:45     ` Could it be made possible to offer "supplementary" data to a DIO write ? Adam Borowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.