linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Dirty bits and sync writes
@ 2021-08-03 15:28 Matthew Wilcox
  2021-08-09 14:48 ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-08-03 15:28 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Zhengyuan Liu, yukuai3, Christoph Hellwig, Dave Chinner,
	David Howells, linux-xfs

Back in July 2020, Zhengyuan Liu noticed a performance problem with
XFS which I've realised is going to be a problem for multi-page folios.
Much of this is iomap specific, but solution 3 is not, and perhaps the
problems pertain to other filesystems too.

The original email is here:

https://lore.kernel.org/linux-xfs/CAOOPZo45E+hVAo9S_2psMJQzrzwmVKo_WjWOM7Zwhm_CS0J3iA@mail.gmail.com/

but to summarise the workload involves doing 4kB sync writes on a
machine with a 64kB page size.  That ends up amplifying the number of
bytes written by a factor of 16 because iomap doesn't track dirtiness
on a sub-page basis.

Unfortunately, that factor is O(n^2) in the size of the page.  That is,
if you have the very reasonable workload:

dd if=/dev/zero bs=4k conv=fdatasync

If you have a 64k page, you'll write it 16 times.  If you have a 128k
page, you'll write it 32 times.  A 256k page gets written 64 times.
I don't currently have code to create multi-page folios on writes, but
if there's already a multi-page folio in the cache, this will happen.
And we should create multi-page folios on writes, because it's more
efficient for the non-sync-write case.

Solution 1: Add an array of dirty bits to the iomap_page
data structure.  This patch already exists; would need
to be adjusted slightly to apply to the current tree.
https://lore.kernel.org/linux-xfs/7fb4bb5a-adc7-5914-3aae-179dd8f3adb1@huawei.com/

Solution 2a: Replace the array of uptodate bits with an array of
dirty bits.  It is not often useful to know which parts of the page are
uptodate; usually the entire page is uptodate.  We can actually use the
dirty bits for the same purpose as uptodate bits; if a block is dirty, it
is definitely uptodate.  If a block is !dirty, and the page is !uptodate,
the block may or may not be uptodate, but it can be safely re-read from
storage without losing any data.

Solution 2b: Lose the concept of partially uptodate pages.  If we're
going to write to a partial page, just bring the entire page uptodate
first, then write to it.  It's not clear to me that partially-uptodate
pages are really useful.  I don't know of any network filesystems that
support partially-uptodate pages, for example.  It seems to have been
something we did for buffer_head based filesystems "because we could"
rather than finding a workload that actually cares.

Solution 3: Special-case sync writes.  Usually we want the page cache
to absorb a lot of writes and then write them back later.  A sync write
says to treat the page cache as writethrough.  We currently handle
this by writing into the page cache, marking the page(s) as dirty,
looking up the page(s) which are dirty in the range, starting writeback
on each of them, then waiting for writeback to finish on each of them.
See generic_write_sync() for how that happens.  What we could do is,
with the page locked in __iomap_write_end(), if the page is !dirty and
!writeback, mark the page as writeback and start writeback of just the
bytes which have been touched by this write.  generic_write_sync() will
continue to work correctly; it won't start writeback on the page(s)
because they won't be dirty.  It will wait for the writeback to end,
and then proceed to writeback the inode if it was an O_SYNC write rather
than an O_DATASYNC write.

Solutions 2a and 2b do have a bit of a problem with read errors.  If we
care about being able to recover the parts of the file which are outside
a block with read errors, by reading through the page cache, then we
need to have per-block error bits or uptodate bits.  But I don't think
this is terribly interesting; I think we can afford (at the current
low rate of storage errors) to tell users to use O_DIRECT to recover
the still-readable blocks from files.  Again, network filesystems don't
support sub-page uptodate/error bits.

(it occurs to me that solution 3 actually allows us to do IOs at storage
block size instead of filesystem block size, potentially reducing write
amplification even more, although we will need to be a bit careful if
we're doing a CoW.)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dirty bits and sync writes
  2021-08-03 15:28 Dirty bits and sync writes Matthew Wilcox
@ 2021-08-09 14:48 ` Christoph Hellwig
  2021-08-09 15:30   ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2021-08-09 14:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Zhengyuan Liu, yukuai3, Christoph Hellwig,
	Dave Chinner, David Howells, linux-xfs

On Tue, Aug 03, 2021 at 04:28:14PM +0100, Matthew Wilcox wrote:
> Solution 1: Add an array of dirty bits to the iomap_page
> data structure.  This patch already exists; would need
> to be adjusted slightly to apply to the current tree.
> https://lore.kernel.org/linux-xfs/7fb4bb5a-adc7-5914-3aae-179dd8f3adb1@huawei.com/

> Solution 2a: Replace the array of uptodate bits with an array of
> dirty bits.  It is not often useful to know which parts of the page are
> uptodate; usually the entire page is uptodate.  We can actually use the
> dirty bits for the same purpose as uptodate bits; if a block is dirty, it
> is definitely uptodate.  If a block is !dirty, and the page is !uptodate,
> the block may or may not be uptodate, but it can be safely re-read from
> storage without losing any data.

1 or 2a seems like something we should do once we have lage folio
support.


> Solution 2b: Lose the concept of partially uptodate pages.  If we're
> going to write to a partial page, just bring the entire page uptodate
> first, then write to it.  It's not clear to me that partially-uptodate
> pages are really useful.  I don't know of any network filesystems that
> support partially-uptodate pages, for example.  It seems to have been
> something we did for buffer_head based filesystems "because we could"
> rather than finding a workload that actually cares.

The uptodate bit is important for the use case of a smaller than page
size buffered write into a page that hasn't been read in already, which
is fairly common for things like log writes.  So I'd hate to lose this
optimization.

> (it occurs to me that solution 3 actually allows us to do IOs at storage
> block size instead of filesystem block size, potentially reducing write
> amplification even more, although we will need to be a bit careful if
> we're doing a CoW.)

number 3 might be nice optimization.  The even better version would
be a disk format change to just log those updates in the log and
otherwise use the normal dirty mechanism.  I once had a crude prototype
for that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dirty bits and sync writes
  2021-08-09 14:48 ` Christoph Hellwig
@ 2021-08-09 15:30   ` Matthew Wilcox
  2021-08-11 19:04     ` Jeff Layton
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-08-09 15:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Zhengyuan Liu, yukuai3, Dave Chinner,
	David Howells, linux-xfs

On Mon, Aug 09, 2021 at 03:48:56PM +0100, Christoph Hellwig wrote:
> On Tue, Aug 03, 2021 at 04:28:14PM +0100, Matthew Wilcox wrote:
> > Solution 1: Add an array of dirty bits to the iomap_page
> > data structure.  This patch already exists; would need
> > to be adjusted slightly to apply to the current tree.
> > https://lore.kernel.org/linux-xfs/7fb4bb5a-adc7-5914-3aae-179dd8f3adb1@huawei.com/
> 
> > Solution 2a: Replace the array of uptodate bits with an array of
> > dirty bits.  It is not often useful to know which parts of the page are
> > uptodate; usually the entire page is uptodate.  We can actually use the
> > dirty bits for the same purpose as uptodate bits; if a block is dirty, it
> > is definitely uptodate.  If a block is !dirty, and the page is !uptodate,
> > the block may or may not be uptodate, but it can be safely re-read from
> > storage without losing any data.
> 
> 1 or 2a seems like something we should do once we have lage folio
> support.
> 
> 
> > Solution 2b: Lose the concept of partially uptodate pages.  If we're
> > going to write to a partial page, just bring the entire page uptodate
> > first, then write to it.  It's not clear to me that partially-uptodate
> > pages are really useful.  I don't know of any network filesystems that
> > support partially-uptodate pages, for example.  It seems to have been
> > something we did for buffer_head based filesystems "because we could"
> > rather than finding a workload that actually cares.
> 
> The uptodate bit is important for the use case of a smaller than page
> size buffered write into a page that hasn't been read in already, which
> is fairly common for things like log writes.  So I'd hate to lose this
> optimization.
> 
> > (it occurs to me that solution 3 actually allows us to do IOs at storage
> > block size instead of filesystem block size, potentially reducing write
> > amplification even more, although we will need to be a bit careful if
> > we're doing a CoW.)
> 
> number 3 might be nice optimization.  The even better version would
> be a disk format change to just log those updates in the log and
> otherwise use the normal dirty mechanism.  I once had a crude prototype
> for that.

That's a bit beyond my scope at this point.  I'm currently working on
write-through.  Once I have that working, I think the next step is:

 - Replace the ->uptodate array with a ->dirty array
 - If the entire page is Uptodate, drop the iomap_page.  That means that
   writebacks will write back the entire folio, not just the dirty
   pieces.
 - If doing a partial page write
   - If the write is block-aligned (offset & length), leave the page
     !Uptodate and mark the dirty blocks
   - Otherwise bring the entire page Uptodate first, then mark it dirty

To take an example of a 512-byte block size file accepting a 520 byte
write at offset 500, we currently submit two reads, one for bytes 0-511
and the second for 1024-1535.  We're better off submitting a read for
bytes 0-4095 and then overwriting the entire thing.

But it's still better to do no reads at all if someone submits a write
for bytes 512-1023, or 512-N where N is past EOF.  And I'd preserve
that behaviour.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dirty bits and sync writes
  2021-08-09 15:30   ` Matthew Wilcox
@ 2021-08-11 19:04     ` Jeff Layton
  2021-08-11 19:42       ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Layton @ 2021-08-11 19:04 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Hellwig
  Cc: linux-fsdevel, Zhengyuan Liu, yukuai3, Dave Chinner,
	David Howells, linux-xfs

On Mon, 2021-08-09 at 16:30 +0100, Matthew Wilcox wrote:
> On Mon, Aug 09, 2021 at 03:48:56PM +0100, Christoph Hellwig wrote:
> > On Tue, Aug 03, 2021 at 04:28:14PM +0100, Matthew Wilcox wrote:
> > > Solution 1: Add an array of dirty bits to the iomap_page
> > > data structure.  This patch already exists; would need
> > > to be adjusted slightly to apply to the current tree.
> > > https://lore.kernel.org/linux-xfs/7fb4bb5a-adc7-5914-3aae-179dd8f3adb1@huawei.com/
> > 
> > > Solution 2a: Replace the array of uptodate bits with an array of
> > > dirty bits.  It is not often useful to know which parts of the page are
> > > uptodate; usually the entire page is uptodate.  We can actually use the
> > > dirty bits for the same purpose as uptodate bits; if a block is dirty, it
> > > is definitely uptodate.  If a block is !dirty, and the page is !uptodate,
> > > the block may or may not be uptodate, but it can be safely re-read from
> > > storage without losing any data.
> > 
> > 1 or 2a seems like something we should do once we have lage folio
> > support.
> > 
> > 
> > > Solution 2b: Lose the concept of partially uptodate pages.  If we're
> > > going to write to a partial page, just bring the entire page uptodate
> > > first, then write to it.  It's not clear to me that partially-uptodate
> > > pages are really useful.  I don't know of any network filesystems that
> > > support partially-uptodate pages, for example.  It seems to have been
> > > something we did for buffer_head based filesystems "because we could"
> > > rather than finding a workload that actually cares.
> > 

I may be wrong, but I thought NFS actually could deal with partially
uptodate pages. In some cases it can opt to just do a write to a page
w/o reading first and flush just that section when the time comes.

I think the heuristics are in nfs_want_read_modify_write(). #3 may be a
better way though.

> > The uptodate bit is important for the use case of a smaller than page
> > size buffered write into a page that hasn't been read in already, which
> > is fairly common for things like log writes.  So I'd hate to lose this
> > optimization.
> > 
> > > (it occurs to me that solution 3 actually allows us to do IOs at storage
> > > block size instead of filesystem block size, potentially reducing write
> > > amplification even more, although we will need to be a bit careful if
> > > we're doing a CoW.)
> > 
> > number 3 might be nice optimization.  The even better version would
> > be a disk format change to just log those updates in the log and
> > otherwise use the normal dirty mechanism.  I once had a crude prototype
> > for that.
> 
> That's a bit beyond my scope at this point.  I'm currently working on
> write-through.  Once I have that working, I think the next step is:
> 
>  - Replace the ->uptodate array with a ->dirty array
>  - If the entire page is Uptodate, drop the iomap_page.  That means that
>    writebacks will write back the entire folio, not just the dirty
>    pieces.
>  - If doing a partial page write
>    - If the write is block-aligned (offset & length), leave the page
>      !Uptodate and mark the dirty blocks
>    - Otherwise bring the entire page Uptodate first, then mark it dirty
> 
> To take an example of a 512-byte block size file accepting a 520 byte
> write at offset 500, we currently submit two reads, one for bytes 0-511
> and the second for 1024-1535.  We're better off submitting a read for
> bytes 0-4095 and then overwriting the entire thing.
> 
> But it's still better to do no reads at all if someone submits a write
> for bytes 512-1023, or 512-N where N is past EOF.  And I'd preserve
> that behaviour.
> 

I like this idea too.

I'd also point out that both cifs and ceph (at least) can read and write
"around" the cache in some cases (using non-pagecache pages) when they
can't get the proper oplock/lease/caps from the server. Both of them
have completely separate "uncached" codepaths, that are distinct from
the O_DIRECT cases.

This scheme could potentially be a saner method of dealing with those
situations too.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dirty bits and sync writes
  2021-08-11 19:04     ` Jeff Layton
@ 2021-08-11 19:42       ` Matthew Wilcox
  2021-08-11 20:40         ` Jeff Layton
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-08-11 19:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, linux-fsdevel, Zhengyuan Liu, yukuai3,
	Dave Chinner, David Howells, linux-xfs

On Wed, Aug 11, 2021 at 03:04:07PM -0400, Jeff Layton wrote:
> On Mon, 2021-08-09 at 16:30 +0100, Matthew Wilcox wrote:
> > On Mon, Aug 09, 2021 at 03:48:56PM +0100, Christoph Hellwig wrote:
> > > On Tue, Aug 03, 2021 at 04:28:14PM +0100, Matthew Wilcox wrote:
> > > > Solution 1: Add an array of dirty bits to the iomap_page
> > > > data structure.  This patch already exists; would need
> > > > to be adjusted slightly to apply to the current tree.
> > > > https://lore.kernel.org/linux-xfs/7fb4bb5a-adc7-5914-3aae-179dd8f3adb1@huawei.com/
> > > 
> > > > Solution 2a: Replace the array of uptodate bits with an array of
> > > > dirty bits.  It is not often useful to know which parts of the page are
> > > > uptodate; usually the entire page is uptodate.  We can actually use the
> > > > dirty bits for the same purpose as uptodate bits; if a block is dirty, it
> > > > is definitely uptodate.  If a block is !dirty, and the page is !uptodate,
> > > > the block may or may not be uptodate, but it can be safely re-read from
> > > > storage without losing any data.
> > > 
> > > 1 or 2a seems like something we should do once we have lage folio
> > > support.
> > > 
> > > 
> > > > Solution 2b: Lose the concept of partially uptodate pages.  If we're
> > > > going to write to a partial page, just bring the entire page uptodate
> > > > first, then write to it.  It's not clear to me that partially-uptodate
> > > > pages are really useful.  I don't know of any network filesystems that
> > > > support partially-uptodate pages, for example.  It seems to have been
> > > > something we did for buffer_head based filesystems "because we could"
> > > > rather than finding a workload that actually cares.
> > > 
> 
> I may be wrong, but I thought NFS actually could deal with partially
> uptodate pages. In some cases it can opt to just do a write to a page
> w/o reading first and flush just that section when the time comes.
> 
> I think the heuristics are in nfs_want_read_modify_write(). #3 may be a
> better way though.

Yes; I was talking specifically about iomap here.  I like what NFS does;
it makes a lot of sense to me.  The only thing I'd like to change for
NFS is to add an ->is_partially_uptodate implementation.  But that
only matters if somebody calls read() on some bytes they just wrote.
And I don't know that that's a particularly common or interesting
thing to do.  I suppose if there are two processes with one writing
to the file and the other reading from it, we might get that.

> > > The uptodate bit is important for the use case of a smaller than page
> > > size buffered write into a page that hasn't been read in already, which
> > > is fairly common for things like log writes.  So I'd hate to lose this
> > > optimization.
> > > 
> > > > (it occurs to me that solution 3 actually allows us to do IOs at storage
> > > > block size instead of filesystem block size, potentially reducing write
> > > > amplification even more, although we will need to be a bit careful if
> > > > we're doing a CoW.)
> > > 
> > > number 3 might be nice optimization.  The even better version would
> > > be a disk format change to just log those updates in the log and
> > > otherwise use the normal dirty mechanism.  I once had a crude prototype
> > > for that.
> > 
> > That's a bit beyond my scope at this point.  I'm currently working on
> > write-through.  Once I have that working, I think the next step is:
> > 
> >  - Replace the ->uptodate array with a ->dirty array
> >  - If the entire page is Uptodate, drop the iomap_page.  That means that
> >    writebacks will write back the entire folio, not just the dirty
> >    pieces.
> >  - If doing a partial page write
> >    - If the write is block-aligned (offset & length), leave the page
> >      !Uptodate and mark the dirty blocks
> >    - Otherwise bring the entire page Uptodate first, then mark it dirty
> > 
> > To take an example of a 512-byte block size file accepting a 520 byte
> > write at offset 500, we currently submit two reads, one for bytes 0-511
> > and the second for 1024-1535.  We're better off submitting a read for
> > bytes 0-4095 and then overwriting the entire thing.
> > 
> > But it's still better to do no reads at all if someone submits a write
> > for bytes 512-1023, or 512-N where N is past EOF.  And I'd preserve
> > that behaviour.
> 
> I like this idea too.
> 
> I'd also point out that both cifs and ceph (at least) can read and write
> "around" the cache in some cases (using non-pagecache pages) when they
> can't get the proper oplock/lease/caps from the server. Both of them
> have completely separate "uncached" codepaths, that are distinct from
> the O_DIRECT cases.
> 
> This scheme could potentially be a saner method of dealing with those
> situations too.

Possibly ... how do cifs & ceph handle writable mmap() when they can't
get the proper exclusive access from the server?

The major downside I can see is that if you have multiple writes to
the same page then the writes will be serialised.  But maybe they
already are by the scheme used by cifs & ceph?

(That could be worked around by tracking outstanding writes to the
server and only clearing the writeback bit once all writes have
completed)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Dirty bits and sync writes
  2021-08-11 19:42       ` Matthew Wilcox
@ 2021-08-11 20:40         ` Jeff Layton
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff Layton @ 2021-08-11 20:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-fsdevel, Zhengyuan Liu, yukuai3,
	Dave Chinner, David Howells, linux-xfs

On Wed, 2021-08-11 at 20:42 +0100, Matthew Wilcox wrote:
> On Wed, Aug 11, 2021 at 03:04:07PM -0400, Jeff Layton wrote:
> > On Mon, 2021-08-09 at 16:30 +0100, Matthew Wilcox wrote:
> > > On Mon, Aug 09, 2021 at 03:48:56PM +0100, Christoph Hellwig wrote:
> > > > On Tue, Aug 03, 2021 at 04:28:14PM +0100, Matthew Wilcox wrote:
> > > > > Solution 1: Add an array of dirty bits to the iomap_page
> > > > > data structure.  This patch already exists; would need
> > > > > to be adjusted slightly to apply to the current tree.
> > > > > https://lore.kernel.org/linux-xfs/7fb4bb5a-adc7-5914-3aae-179dd8f3adb1@huawei.com/
> > > > 
> > > > > Solution 2a: Replace the array of uptodate bits with an array of
> > > > > dirty bits.  It is not often useful to know which parts of the page are
> > > > > uptodate; usually the entire page is uptodate.  We can actually use the
> > > > > dirty bits for the same purpose as uptodate bits; if a block is dirty, it
> > > > > is definitely uptodate.  If a block is !dirty, and the page is !uptodate,
> > > > > the block may or may not be uptodate, but it can be safely re-read from
> > > > > storage without losing any data.
> > > > 
> > > > 1 or 2a seems like something we should do once we have lage folio
> > > > support.
> > > > 
> > > > 
> > > > > Solution 2b: Lose the concept of partially uptodate pages.  If we're
> > > > > going to write to a partial page, just bring the entire page uptodate
> > > > > first, then write to it.  It's not clear to me that partially-uptodate
> > > > > pages are really useful.  I don't know of any network filesystems that
> > > > > support partially-uptodate pages, for example.  It seems to have been
> > > > > something we did for buffer_head based filesystems "because we could"
> > > > > rather than finding a workload that actually cares.
> > > > 
> > 
> > I may be wrong, but I thought NFS actually could deal with partially
> > uptodate pages. In some cases it can opt to just do a write to a page
> > w/o reading first and flush just that section when the time comes.
> > 
> > I think the heuristics are in nfs_want_read_modify_write(). #3 may be a
> > better way though.
> 
> Yes; I was talking specifically about iomap here.  I like what NFS does;
> it makes a lot of sense to me.  The only thing I'd like to change for
> NFS is to add an ->is_partially_uptodate implementation.  But that
> only matters if somebody calls read() on some bytes they just wrote.
> And I don't know that that's a particularly common or interesting
> thing to do.  I suppose if there are two processes with one writing
> to the file and the other reading from it, we might get that.
> 

Typically if you have that sort of competing (complimentary?) access
between tasks, then you probably want to just do page I/O. I think NFS
only does the partial page thing when the file is not open for read at
all. folios may change that calculation though.

> > > > The uptodate bit is important for the use case of a smaller than page
> > > > size buffered write into a page that hasn't been read in already, which
> > > > is fairly common for things like log writes.  So I'd hate to lose this
> > > > optimization.
> > > > 
> > > > > (it occurs to me that solution 3 actually allows us to do IOs at storage
> > > > > block size instead of filesystem block size, potentially reducing write
> > > > > amplification even more, although we will need to be a bit careful if
> > > > > we're doing a CoW.)
> > > > 
> > > > number 3 might be nice optimization.  The even better version would
> > > > be a disk format change to just log those updates in the log and
> > > > otherwise use the normal dirty mechanism.  I once had a crude prototype
> > > > for that.
> > > 
> > > That's a bit beyond my scope at this point.  I'm currently working on
> > > write-through.  Once I have that working, I think the next step is:
> > > 
> > >  - Replace the ->uptodate array with a ->dirty array
> > >  - If the entire page is Uptodate, drop the iomap_page.  That means that
> > >    writebacks will write back the entire folio, not just the dirty
> > >    pieces.
> > >  - If doing a partial page write
> > >    - If the write is block-aligned (offset & length), leave the page
> > >      !Uptodate and mark the dirty blocks
> > >    - Otherwise bring the entire page Uptodate first, then mark it dirty
> > > 
> > > To take an example of a 512-byte block size file accepting a 520 byte
> > > write at offset 500, we currently submit two reads, one for bytes 0-511
> > > and the second for 1024-1535.  We're better off submitting a read for
> > > bytes 0-4095 and then overwriting the entire thing.
> > > 
> > > But it's still better to do no reads at all if someone submits a write
> > > for bytes 512-1023, or 512-N where N is past EOF.  And I'd preserve
> > > that behaviour.
> > 
> > I like this idea too.
> > 
> > I'd also point out that both cifs and ceph (at least) can read and write
> > "around" the cache in some cases (using non-pagecache pages) when they
> > can't get the proper oplock/lease/caps from the server. Both of them
> > have completely separate "uncached" codepaths, that are distinct from
> > the O_DIRECT cases.
> > 
> > This scheme could potentially be a saner method of dealing with those
> > situations too.
> 
> Possibly ... how do cifs & ceph handle writable mmap() when they can't
> get the proper exclusive access from the server?
> 
> The major downside I can see is that if you have multiple writes to
> the same page then the writes will be serialised.  But maybe they
> already are by the scheme used by cifs & ceph?
> 
> (That could be worked around by tracking outstanding writes to the
> server and only clearing the writeback bit once all writes have
> completed)


They punt and use the pagecache when they don't have exclusive access.
It's not ideal. The whole uncached I/O concept more or less breaks down
with mmap in play, tbqh.

I don't think they end up serializing access in most of these
situations. They just allocate enough bounce buffer pages to do the
write and then issue it synchronously (or in some cases just use the
pages from userland). Ditto for reads.

Serializing access to the pages in this situation is probably fine,
though. These are expected to be slow, safe codepaths that are primarily
used when there is competing access to the same files by different
clients.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-11 20:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03 15:28 Dirty bits and sync writes Matthew Wilcox
2021-08-09 14:48 ` Christoph Hellwig
2021-08-09 15:30   ` Matthew Wilcox
2021-08-11 19:04     ` Jeff Layton
2021-08-11 19:42       ` Matthew Wilcox
2021-08-11 20:40         ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).