Re: [LSF/MM/BPF TOPIC] untorn buffered writes

From: Dave Chinner <david@fromorbit.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Matthew Wilcox <willy@infradead.org>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes
Date: Thu, 29 Feb 2024 12:07:02 +1100	[thread overview]
Message-ID: <Zd/YtmTBUN7jFg4X@dread.disaster.area> (raw)
In-Reply-To: <20240228233354.GC177082@mit.edu>

On Wed, Feb 28, 2024 at 05:33:54PM -0600, Theodore Ts'o wrote:
> On Wed, Feb 28, 2024 at 02:11:06PM +0000, Matthew Wilcox wrote:
> > I'm not entirely sure that it does become a mess.  If our implementation
> > of this ensures that each write ends up in a single folio (even if the
> > entire folio is larger than the write), then we will have satisfied the
> > semantics of the flag.
> 
> What if we do a 32k write which spans two folios?  And what
> if the physical pages for those 32k in the buffer cache are not
> contiguous?  Are you going to have to join the two 16k folios
> together, or maybe two 8k folios and an 16k folio, and relocate pages
> to make a contiguous 32k folio when we do a buffered RWF_ATOMIC write
> of size 32k?

RWF_ATOMIC defines contraints that a 32kB write must be 32kB
aligned. So the only way a 32kB write would span two folios is if
a 16kB write had already been done in this space.

WE are already dealing with this problem for bs > ps with the min
order mapping constraint. We can deal with this easily by ensuring
that when we set the inode as supporting atomic writes. This already
ensures physical extent allocation alignment, we can also set the
mapping folio order at this time to ensure that we only allocate
RWF_ATOMIC compatible aligned/sized folios....

> > I think we'd be better off treating RWF_ATOMIC like it's a bs>PS device.

Which is why Willy says this...

> > That takes two somewhat special cases and makes them use the same code
> > paths, which probably means fewer bugs as both camps will be testing
> > the same code.
> 
> But for a bs > PS device, where the logical block size is greater than
> the page size, you don't need the RWF_ATOMIC flag at all.

Yes we do - hardware already supports REQ_ATOMIC sizes larger than
64kB filesystem blocks. i.e. RWF_ATOMIC is not restricted to 64kB
or any specific filesystem block size, and can always be larger than
the filesystem block size.

> All direct
> I/O writes *must* be a multiple of the logical sector size, and
> buffered writes, if they are smaller than the block size, *must* be
> handled as a read-modify-write, since you can't send writes to the
> device smaller than the logical sector size.

The filesystem will likely need to constrain minimum RWF_ATOMIC
sizes to a single filesystem block. That's the whole point of having
the statx interface - the application is going to have to query what
the min/max atomic write sizes supported are and adjust to those.
Applications will not be able to use 2kB RWF_ATOMIC writes on a 4kB
block size filesystem, and it's no different with larger filesystem
block sizes.

-Dave.

-- 
Dave Chinner
david@fromorbit.com