Re: [PATCH -next] ext4: ensure data forced to disk when rename

From: "Theodore Ts'o" <tytso@mit.edu>
To: Zhang Yi <yi.zhang@huawei.com>
Cc: Jan Kara <jack@suse.cz>, Ye Bin <yebin10@huawei.com>,
	adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH -next] ext4: ensure data forced to disk when rename
Date: Thu, 1 Sep 2022 10:45:15 -0400	[thread overview]
Message-ID: <YxDFewyU6orvSfZX@mit.edu> (raw)
In-Reply-To: <ce8666e2-c224-79f9-7770-f87b31d122e3@huawei.com>

On Thu, Sep 01, 2022 at 09:19:04PM +0800, Zhang Yi wrote:
> 
> Our case is modifing a no-empty file names "data.conf" through vim, it will
> cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp"
> back to "data.conf" and fsync after modifying. If we power down between rename
> and fsync, we may lose all data in the original "data.conf" and get a whole
> zero file. Before we enable dioread_nolock by defalut, the newly appending data
> may lost, but at least we will not lose the original data in the default
> data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the
> guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename
> in vim ?

What I normally recommend is to write the new contents of "data.conf"
to "data.conf.new".  Then fsync the file descriptor corresponding to
data.conf.new.  If you want to backup data.conf, you can create a hard
link of data.conf to data.conf.bak, then rename data.conf.new on top
of data.conf.

The auto_da_alloc mode is a best effort attempt to protect against bad
application programs, and it never was a guarantee that it would
*always* protect against data loss if you had a overwriting rename
racing against a power failure.  Without auto_da_alloc mode, the
window where the user might lose data if they application failed to do
the fsync() was 30 seconds.  With auto_da_alloc (and if dioread_nolock
was disabled), that window was reduced to say, a few hundred
milliseconds.  With dioread_nolock, that window was increased to
somehwere between 0 and 5 seconds (on average, 2.5 seconds) plus the
time for the write to complete (a few hundred milliseconds).

But note that your proposed patch doesn't actually really improve
things all that much.  That's because calling filemap_datawrite() only
waits for the write request to complete.  But you still need to update
the metadata before the blocks become visible after a crash.  That
requires forcing a journal commit, and waiting for that commit to
complete, which is what ext4_sync_file() does, and which is of course,
quite expensive.

We could add something to the end of ext4_convert_unwritten_io_end_vec()
where if the inode is da_alloc eligible, we trigger an asynchronous
journal commit; that is, once we converted all of the unwritten extents,
if the file has been closed after being opened with O_TRUNC, or the file has
been renamed on top of a pre-existing file and there were dirty blocks
that were flushed out when we called ext4_alloc_da_blocks(), that
ext4_convert_unwritten_io_end_vec() would then request that a journal
commit be started.   But we'd want to see what sort of performance regression
this adds, since nothing is for free, and the goal here is to paper over
buggy applications without imposing too much of a performance cost.

But it should be clear that this is *buggy* applications, and
applications SHOULD be calling fsync() before an overwriting rename()
if they care about data not being lost if there is a power failure at
an inconvenient point.   All we are trying to do here is to reduce the
probability of data loss caused by buggy applications.  There was never
any guarantees.

Cheers,

						- Ted