Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

From: John Garry <john.g.garry@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	hch@lst.de, viro@zeniv.linux.org.uk, brauner@kernel.org,
	dchinner@redhat.com, jack@suse.cz, chandan.babu@oracle.com,
	martin.petersen@oracle.com, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	tytso@mit.edu, jbongio@google.com, ojaswin@linux.ibm.com
Subject: Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
Date: Fri, 9 Feb 2024 12:47:38 +0000	[thread overview]
Message-ID: <20836bd6-7b17-4432-a2b9-085e27014384@oracle.com> (raw)
In-Reply-To: <ZcWCeU0n7zKEPHk5@dread.disaster.area>

>>
>> Playing devil's advocate here, at least this behavior should be documented.
> 
> That's what man pages are for, yes?
> 
> Are you expecting your deployments to be run on highly suboptimal
> configurations and so the code needs to be optimised for this
> behaviour, or are you expecting them to be run on correctly
> configured systems which would never see these issues?

The latter hopefully

> 
> 
>>> The whole reason for rtextsize existing is to optimise the rtextent
>>> allocation to the typical minimum IO size done to that volume. If
>>> all your IO is sub-rtextsize size and alignment, then all that has
>>> been done is forcing the entire rt device IO into a corner it was
>>> never really intended nor optimised for.
>>
>> Sure, but just because we are optimized for a certain IO write size should
>> not mean that other writes are disallowed or quite problematic.
> 
> Atomic writes are just "other writes". They are writes that are
> *expected to fail* if they cannot be done atomically.

Agreed

> 
> Application writers will quickly learn how to do sane, fast,
> reliable atomic write IO if we reject anything that is going to
> requires some complex, sub-optimal workaround in the kernel to make
> it work. The simplest solution is to -fail the write-, because
> userspace *must* be prepared for *any* atomic write to fail.

Sure, but it needs to be such that the application writer at least knows 
why it failed, which so far had not been documented.

> 
>>> Why should we jump through crazy hoops to try to make filesystems
>>> optimised for large IOs with mismatched, overlapping small atomic
>>> writes?
>>
>> As mentioned, typically the atomic writes will be the same size, but we may
>> have other writes of smaller size.
> 
> Then we need the tiny write to allocate and zero according to the
> maximum sized atomic write bounds. Then we just don't care about
> large atomic IO overlapping small IO, because the extent on disk
> aligned to the large atomic IO is then always guaranteed to be the
> correct size and shape.

I think it's worth mentioning that there is currently a separation 
between how we configure the FS extent size for atomic writes and what 
the bdev can actually support in terms of atomic writes.

Setting the rtvol extsize at mkfs time or enabling atomic writes 
FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do 
in terms of atomic writes.

This check is not done as it is not fixed what the bdev can do in terms 
of atomic writes - or, more specifically, what they request_queue 
reports is not be fixed. There are things which can change this. For 
example, a FW update could change all the atomic write capabilities of a 
disk. Or even if we swapped a SCSI disk into another host the atomic 
write limits may change, as the atomic write unit max depends on the 
SCSI HBA DMA limits. Changing BIO_MAX_VECS - which could come from a 
kernel update - could also change what we report as atomic write limit 
in the request queue.

> 
> 
>>>> With the change in this patch, instead we have something like this after the
>>>> first write:
>>>>
>>>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
>>>> wrote 4096 bytes at pos 0 write_size=4096
>>>> # filefrag -v mnt/file
>>>> Filesystem type is: 58465342
>>>> File size of mnt/file is 4096 (1 block of 4096 bytes)
>>>>     ext:     logical_offset:        physical_offset: length:   expected:
>>>> flags:
>>>>       0:        0..       3:         24..        27:      4:
>>>> last,eof
>>>> mnt/file: 1 extent found
>>>> #
>>>>
>>>> So the 16KB extent is in written state and the 2nd 16KB write would iter
>>>> once, producing a single BIO.
>>> Sure, I know how it works. My point is that it's a terrible way to
>>> go about allowing that second atomic write to succeed.
>> I think 'terrible' is a bit too strong a word here.
> 
> Doing it anything in a way that a user can DOS the entire filesystem
> is *terrible*. No ifs, buts or otherwise.

Understood

> 
>> Indeed, you suggest to
>> manually zero the file to solve this problem, below, while this code change
>> does the same thing automatically.
> 
> Yes, but I also outlined a way that it can be done automatically
> without being terrible. There are multiple options here, I outlined
> two different approaches that are acceptible.

I think that I need to check these alternate solutions in more detail. 
More below.

> 
>>>>>> In this
>>>>>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
>>>>>> ensure that the extent is completely written whenever we allocate it. At
>>>>>> least that is my idea.
>>>>> So return an unaligned extent, and then the IOMAP_ATOMIC checks you
>>>>> add below say "no" and then the application has to do things the
>>>>> slow, safe way....
>>>> We have been porting atomic write support to some database apps and they
>>>> (database developers) have had to do something like manually zero the
>>>> complete file to get around this issue, but that's not a good user
>>>> experience.
>>> Better the application zeros the file when it is being initialised
>>> and doesn't have performance constraints rather than forcing the
>>> filesystem to do it in the IO fast path when IO performance and
>>> latency actually matters to the application.
>>
>> Can't we do both? I mean, the well-informed user can still pre-zero the file
>> just to ensure we aren't doing this zero'ing with the extent allocation.
> 
> I never said we can't do zeroing. I just said that it's normally
> better when the application controls zeroing directly.

ok

> 
>>> And therein lies the problem.
>>>
>>> If you are doing sub-rtextent IO at all, then you are forcing the
>>> filesystem down the path of explicitly using unwritten extents and
>>> requiring O_DSYNC direct IO to do journal flushes in IO completion
>>> context and then performance just goes down hill from them.
>>>
>>> The requirement for unwritten extents to track sub-rtextsize written
>>> regions is what you're trying to work around with XFS_BMAPI_ZERO so
>>> that atomic writes will always see "atomic write aligned" allocated
>>> regions.
>>>
>>> Do you see the problem here? You've explicitly told the filesystem
>>> that allocation is aligned to 64kB chunks, then because the
>>> filesystem block size is 4kB, it's allowed to track unwritten
>>> regions at 4kB boundaries. Then you do 4kB aligned file IO, which
>>> then changes unwritten extents at 4kB boundaries. Then you do a
>>> overlapping 16kB IO that*requires*  16kB allocation alignment, and
>>> things go BOOM.
>>>
>>> Yes, they should go BOOM.
>>>
>>> This is a horrible configuration - it is incomaptible with 16kB
>>> aligned and sized atomic IO.
>>
>> Just because the DB may do 16KB atomic writes most of the time should not
>> disallow it from any other form of writes.
> 
> That's not what I said. I said the using sub-rtextsize atomic writes
> with single FSB unwritten extent tracking is horrible and
> incompatible with doing 16kB atomic writes.
> 
> This setup will not work at all well with your patches and should go
> BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
> has uncoordinated extent allocation and unwritten conversion
> granularity.
> 
> That's the fundamental design problem with your approach - it allows
> unwritten conversion at *minimum IO sizes* and that does not work
> with atomic IOs with larger alignment requirements.
> 
> The fundamental design principle is this: for maximally sized atomic
> writes to always succeed we require every allocation, zeroing and
> unwritten conversion operation to use alignments and sizes that are
> compatible with the maximum atomic write sizes being used.
> 

That sounds fine.

My question then is how we determine this max atomic write size granularity.

We don't explicitly tell the FS what atomic write size we want for a 
file. Rather we mkfs with some extsize value which should match our 
atomic write maximal value and then tell the FS we want to do atomic 
writes on a file, and if this is accepted then we can query the atomic 
write min and max unit size, and this would be [FS block size, min(bdev 
atomic write limit, rtexsize)].

If rtextsize is 16KB, then we have a good idea that we want 16KB atomic 
writes support. So then we could use rtextsize as this max atomic write 
size. But I am not 100% sure that it your idea (apologies if I am wrong 
- I am sincerely trying to follow your idea), but rather it would be 
min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and 
bdev atomic write limit is 16KB, then there is no much point in dealing 
in 1MB blocks for this unwritten extent conversion alignment. If so, 
then my concern is that the bdev atomic write upper limit is not fixed. 
This can solved, but I would still like to be clear on this max atomic 
write size.

 > i.e. atomic writes need to use max write size granularity for all IO
 > operations, not filesystem block granularity.
> 
> And that also means things like rtextsize and extsize hints need to
> match these atomic write requirements, too....
> 

As above, I am not 100% sure if you mean these to be the atomic write 
maximal value.

>>> Allocation is aligned to 64kB, written
>>> region tracking is aligned to 4kB, and there's nothing to tell the
>>> filesystem that it should be maintaining 16kB "written alignment" so
>>> that 16kB atomic writes can always be issued atomically.

Please note that in my previous example the mkfs rtextsize arg should 
really have been 16KB, and that the intention would have been to enable 
16KB atomic writes. I used 64KB casually as I thought it should be 
possible to support sub-rtextsize atomic writes. The point which I was 
trying to make was that the 16KB atomic write and 4KB regular write 
intermixing was problematic.

>>>
>>> i.e. if we are going to do 16kB aligned atomic IO, then all the
>>> allocation and unwritten tracking needs to be done in 16kB aligned
>>> chunks, not 4kB. That means a 4KB write into an unwritten region or
>>> a hole actually needs to zero the rest of the 16KB range it sits
>>> within.
>>>
>>> The direct IO code can do this, but it needs extension of the
>>> unaligned IO serialisation in XFS (the alignment checks in
>>> xfs_file_dio_write()) and the the sub-block zeroing in
>>> iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
>>> allocation size, not the fsblock size) to do this safely.
>>>
>>> Regardless of how we do it, all IO concurrency on this file is shot
>>> if we have sub-rtextent sized IOs being done. That is true even with
>>> this patch set - XFS_BMAPI_ZERO is done whilst holding the
>>> XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
>>> zeroing is being done.
>>>
>>> IOWs, anything to do with sub-rtextent IO really has to be treated
>>> like sub-fsblock DIO - i.e. exclusive inode access until the
>>> sub-rtextent zeroing has been completed.
>>
>> I do understand that this is not perfect that we may have mixed block sizes
>> being written, but I don't think that we should disallow it and throw an
>> error.
> 
> Ummmm, did you read what you quoted?
> 
> The above is an outline of the IO path modifications that will allow
> mixed IO sizes to be used with atomic writes without requiring the
> XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment
> zeroing out to the existing DIO sub-block zeroing, hence ensuring
> that we only ever convert unwritten extents on max sized atomic
> write boundaries for atomic write enabled inodes.

ok, I get this idea. And, indeed, it does sound better than the 
XFS_BMAPI_ZERO proposal.

> 
> At no point have I said "no mixed writes".

For sure

> I've said no to the
> XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue
> that it works around and given you a decent amount of detail on how
> to sanely implementing mixed write support that will work (slowly)
> with those configurations and IO patterns.
> 
> So it's your choice - you can continue to beleive I don't mixed
> writes to work at all, or you can go back and try to understand the
> IO path changes I've suggested that will allow mixed atomic writes
> to work as well as they possibly can....
> 

Ack

Much appreciated,
John