Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

From: John Garry <john.g.garry@oracle.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: hch@lst.de, viro@zeniv.linux.org.uk, brauner@kernel.org,
	dchinner@redhat.com, jack@suse.cz, chandan.babu@oracle.com,
	martin.petersen@oracle.com, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	tytso@mit.edu, jbongio@google.com, ojaswin@linux.ibm.com
Subject: Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
Date: Mon, 5 Feb 2024 13:36:03 +0000	[thread overview]
Message-ID: <e61cf382-66bd-4091-b49c-afbb5ce67d8f@oracle.com> (raw)
In-Reply-To: <20240202184758.GA6226@frogsfrogsfrogs>

On 02/02/2024 18:47, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
>> Ensure that when creating a mapping that we adhere to all the atomic
>> write rules.
>>
>> We check that the mapping covers the complete range of the write to ensure
>> that we'll be just creating a single mapping.
>>
>> Currently minimum granularity is the FS block size, but it should be
>> possibly to support lower in future.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>> I am setting this as an RFC as I am not sure on the change in
>> xfs_iomap_write_direct() - it gives the desired result AFAICS.
>>
>>   fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 41 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..758dc1c90a42 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
>>   		}
>>   	}
>>   
>> +	if (xfs_inode_atomicwrites(ip))
>> +		bmapi_flags = XFS_BMAPI_ZERO;
> 
> Why do we want to write zeroes to the disk if we're allocating space
> even if we're not sending an atomic write?
> 
> (This might want an explanation for why we're doing this at all -- it's
> to avoid unwritten extent conversion, which defeats hardware untorn
> writes.)

It's to handle the scenario where we have a partially written extent, 
and then try to issue an atomic write which covers the complete extent. 
In this scenario, the iomap code will issue 2x IOs, which is 
unacceptable. So we ensure that the extent is completely written 
whenever we allocate it. At least that is my idea.

> 
> I think we should support IOCB_ATOMIC when the mapping is unwritten --
> the data will land on disk in an untorn fashion, the unwritten extent
> conversion on IO completion is itself atomic, and callers still have to
> set O_DSYNC to persist anything. 

But does this work for the scenario above?

> Then we can avoid the cost of
> BMAPI_ZERO, because double-writes aren't free.

About double-writes not being free, I thought that this was acceptable 
to just have this write zero when initially allocating the extent as it 
should not add too much overhead in practice, i.e. it's one off.

> 
>> +
>>   	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>>   			rblocks, force, &tp);
>>   	if (error)
>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>>   	if (error)
>>   		goto out_unlock;
>>   
>> +	if (flags & IOMAP_ATOMIC) {
>> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
>> +		unsigned int unit_min, unit_max;
>> +
>> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
>> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
>> +
>> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if ((offset & mp->m_blockmask) ||
>> +		    (length & mp->m_blockmask)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if (imap.br_blockcount == unit_min_fsb ||
>> +		    imap.br_blockcount == unit_max_fsb) {
>> +			/* ok if exactly min or max */
>> +		} else if (imap.br_blockcount < unit_min_fsb ||
>> +			   imap.br_blockcount > unit_max_fsb) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		} else if (!is_power_of_2(imap.br_blockcount)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if (imap.br_startoff &&
>> +		    imap.br_startoff & (imap.br_blockcount - 1)) {
> 
> Not sure why we care about the file position, it's br_startblock that
> gets passed into the bio, not br_startoff.

We just want to ensure that the length of the write is valid w.r.t. to 
the offset within the extent, and br_startoff would be the offset within 
the aligned extent.

> 
> I'm also still not convinced that any of this validation is useful here.
> The block device stack underneath the filesystem can change at any time
> without any particular notice to the fs, so the only way to find out if
> the proposed IO would meet the alignment constraints is to submit_bio
> and see what happens.

I am not sure what submit_bio() would do differently. If the block 
device is changing underneath the block layer, then there is where these 
things need to be checked.

> 
> (The "one bio per untorn write request" thing in the direct-io.c patch
> sound sane to me though.)
> 

ok

Thanks,
John