* [PATCH] zonefs: Always invalidate last cache page on append write
@ 2023-03-29 5:58 Damien Le Moal
2023-03-29 6:14 ` Johannes Thumshirn
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Damien Le Moal @ 2023-03-29 5:58 UTC (permalink / raw)
To: linux-fsdevel; +Cc: Johannes Thumshirn, Hans Holmberg
When a direct append write is executed, the append offset may correspond
to the last page of an inode which might have been cached already by
buffered reads, page faults with mmap-read or non-direct readahead.
To ensure that the on-disk and cached data is consistant for such last
cached page, make sure to always invalidate it in
zonefs_file_dio_append(). This invalidation will always be a no-op when
the device block size is equal to the page size (e.g. 4K).
Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com>
Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
fs/zonefs/file.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 617e4f9db42e..eeab8b93493b 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -390,6 +390,18 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
iov_iter_truncate(from, max);
+ /*
+ * If the inode block size (sector size) is smaller than the
+ * page size, we may be appending data belonging to an already
+ * cached last page of the inode. So make sure to invalidate that
+ * last cached page. This will always be a no-op for the case where
+ * the block size is equal to the page size.
+ */
+ ret = invalidate_inode_pages2_range(inode->i_mapping,
+ iocb->ki_pos >> PAGE_SHIFT, -1);
+ if (ret)
+ return ret;
+
nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
if (!nr_pages)
return 0;
--
2.39.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-29 5:58 [PATCH] zonefs: Always invalidate last cache page on append write Damien Le Moal
@ 2023-03-29 6:14 ` Johannes Thumshirn
2023-03-29 8:14 ` Christoph Hellwig
2023-03-29 11:04 ` Hans Holmberg
2 siblings, 0 replies; 10+ messages in thread
From: Johannes Thumshirn @ 2023-03-29 6:14 UTC (permalink / raw)
To: Damien Le Moal, linux-fsdevel; +Cc: Hans Holmberg
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-29 5:58 [PATCH] zonefs: Always invalidate last cache page on append write Damien Le Moal
2023-03-29 6:14 ` Johannes Thumshirn
@ 2023-03-29 8:14 ` Christoph Hellwig
2023-03-29 8:27 ` Damien Le Moal
2023-03-29 11:04 ` Hans Holmberg
2 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2023-03-29 8:14 UTC (permalink / raw)
To: Damien Le Moal; +Cc: linux-fsdevel, Johannes Thumshirn, Hans Holmberg
On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote:
> + /*
> + * If the inode block size (sector size) is smaller than the
> + * page size, we may be appending data belonging to an already
> + * cached last page of the inode. So make sure to invalidate that
> + * last cached page. This will always be a no-op for the case where
> + * the block size is equal to the page size.
> + */
> + ret = invalidate_inode_pages2_range(inode->i_mapping,
> + iocb->ki_pos >> PAGE_SHIFT, -1);
> + if (ret)
> + return ret;
The missing truncate here obviously is a bug and needs fixing.
But why does this not follow the logic in __iomap_dio_rw to to return
-ENOTBLK for any error so that the write falls back to buffered I/O.
Also as far as I can tell from reading the code, -1 is not a valid
end special case for invalidate_inode_pages2_range, so you'll actually
have to pass a valid end here.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-29 8:14 ` Christoph Hellwig
@ 2023-03-29 8:27 ` Damien Le Moal
2023-03-29 9:49 ` Damien Le Moal
2023-03-29 23:36 ` Christoph Hellwig
0 siblings, 2 replies; 10+ messages in thread
From: Damien Le Moal @ 2023-03-29 8:27 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, Johannes Thumshirn, Hans Holmberg
On 3/29/23 17:14, Christoph Hellwig wrote:
> On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote:
>> + /*
>> + * If the inode block size (sector size) is smaller than the
>> + * page size, we may be appending data belonging to an already
>> + * cached last page of the inode. So make sure to invalidate that
>> + * last cached page. This will always be a no-op for the case where
>> + * the block size is equal to the page size.
>> + */
>> + ret = invalidate_inode_pages2_range(inode->i_mapping,
>> + iocb->ki_pos >> PAGE_SHIFT, -1);
>> + if (ret)
>> + return ret;
>
> The missing truncate here obviously is a bug and needs fixing.
>
> But why does this not follow the logic in __iomap_dio_rw to to return
> -ENOTBLK for any error so that the write falls back to buffered I/O.
This is a write to sequential zones so we cannot use buffered writes. We have to
do a direct write to ensure ordering between writes.
Note that this is the special blocking write case where we issue a zone append.
For async regular writes, we use iomap so this bug does not exist. But then I
now realize that __iomap_dio_rw() falling back to buffered IOs could also create
an issue with write ordering.
> Also as far as I can tell from reading the code, -1 is not a valid
> end special case for invalidate_inode_pages2_range, so you'll actually
> have to pass a valid end here.
I wondered about that but then saw:
int invalidate_inode_pages2(struct address_space *mapping)
{
return invalidate_inode_pages2_range(mapping, 0, -1);
}
EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
which tend to indicate that "-1" is fine. The end is passed to
find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
using -1 seems fine.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-29 8:27 ` Damien Le Moal
@ 2023-03-29 9:49 ` Damien Le Moal
2023-03-29 23:36 ` Christoph Hellwig
1 sibling, 0 replies; 10+ messages in thread
From: Damien Le Moal @ 2023-03-29 9:49 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, Johannes Thumshirn, Hans Holmberg
On 3/29/23 17:27, Damien Le Moal wrote:
> On 3/29/23 17:14, Christoph Hellwig wrote:
>> On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote:
>>> + /*
>>> + * If the inode block size (sector size) is smaller than the
>>> + * page size, we may be appending data belonging to an already
>>> + * cached last page of the inode. So make sure to invalidate that
>>> + * last cached page. This will always be a no-op for the case where
>>> + * the block size is equal to the page size.
>>> + */
>>> + ret = invalidate_inode_pages2_range(inode->i_mapping,
>>> + iocb->ki_pos >> PAGE_SHIFT, -1);
>>> + if (ret)
>>> + return ret;
>>
>> The missing truncate here obviously is a bug and needs fixing.
>>
>> But why does this not follow the logic in __iomap_dio_rw to to return
>> -ENOTBLK for any error so that the write falls back to buffered I/O.
>
> This is a write to sequential zones so we cannot use buffered writes. We have to
> do a direct write to ensure ordering between writes.
>
> Note that this is the special blocking write case where we issue a zone append.
> For async regular writes, we use iomap so this bug does not exist. But then I
> now realize that __iomap_dio_rw() falling back to buffered IOs could also create
> an issue with write ordering.
Checking this, there are no issues as it is the FS caller of iomap_dio_rw() who
has to fallback to buffered IO if it wants to. But zonefs does not do that.
>
>> Also as far as I can tell from reading the code, -1 is not a valid
>> end special case for invalidate_inode_pages2_range, so you'll actually
>> have to pass a valid end here.
>
> I wondered about that but then saw:
>
> int invalidate_inode_pages2(struct address_space *mapping)
> {
> return invalidate_inode_pages2_range(mapping, 0, -1);
> }
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
>
> which tend to indicate that "-1" is fine. The end is passed to
> find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
> using -1 seems fine.
>
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-29 8:27 ` Damien Le Moal
2023-03-29 9:49 ` Damien Le Moal
@ 2023-03-29 23:36 ` Christoph Hellwig
2023-03-29 23:57 ` Damien Le Moal
1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2023-03-29 23:36 UTC (permalink / raw)
To: Damien Le Moal
Cc: Christoph Hellwig, linux-fsdevel, Johannes Thumshirn, Hans Holmberg
On Wed, Mar 29, 2023 at 05:27:43PM +0900, Damien Le Moal wrote:
> > But why does this not follow the logic in __iomap_dio_rw to to return
> > -ENOTBLK for any error so that the write falls back to buffered I/O.
>
> This is a write to sequential zones so we cannot use buffered writes. We have to
> do a direct write to ensure ordering between writes.
>
> Note that this is the special blocking write case where we issue a zone append.
> For async regular writes, we use iomap so this bug does not exist. But then I
> now realize that __iomap_dio_rw() falling back to buffered IOs could also create
> an issue with write ordering.
Can we add a comment please on why this is different? And maybe bundle
the iomap-using path fix into the series while you're at it.
> > Also as far as I can tell from reading the code, -1 is not a valid
> > end special case for invalidate_inode_pages2_range, so you'll actually
> > have to pass a valid end here.
>
> I wondered about that but then saw:
>
> int invalidate_inode_pages2(struct address_space *mapping)
> {
> return invalidate_inode_pages2_range(mapping, 0, -1);
> }
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
>
> which tend to indicate that "-1" is fine. The end is passed to
> find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
> using -1 seems fine.
Oh, indeed. There's a little magic involved. Still, any reason not to
pass the real end like iomap?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-29 23:36 ` Christoph Hellwig
@ 2023-03-29 23:57 ` Damien Le Moal
2023-03-30 0:07 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2023-03-29 23:57 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, Johannes Thumshirn, Hans Holmberg
On 3/30/23 08:36, Christoph Hellwig wrote:
> On Wed, Mar 29, 2023 at 05:27:43PM +0900, Damien Le Moal wrote:
>>> But why does this not follow the logic in __iomap_dio_rw to to return
>>> -ENOTBLK for any error so that the write falls back to buffered I/O.
>>
>> This is a write to sequential zones so we cannot use buffered writes. We have to
>> do a direct write to ensure ordering between writes.
>>
>> Note that this is the special blocking write case where we issue a zone append.
>> For async regular writes, we use iomap so this bug does not exist. But then I
>> now realize that __iomap_dio_rw() falling back to buffered IOs could also create
>> an issue with write ordering.
>
> Can we add a comment please on why this is different? And maybe bundle
> the iomap-using path fix into the series while you're at it.
Not sure what you mean here. "iomap-using path fix" ?
Do you mean adding a comment about the fact that zonefs does not fallback to
doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ?
>
>>> Also as far as I can tell from reading the code, -1 is not a valid
>>> end special case for invalidate_inode_pages2_range, so you'll actually
>>> have to pass a valid end here.
>>
>> I wondered about that but then saw:
>>
>> int invalidate_inode_pages2(struct address_space *mapping)
>> {
>> return invalidate_inode_pages2_range(mapping, 0, -1);
>> }
>> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
>>
>> which tend to indicate that "-1" is fine. The end is passed to
>> find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
>> using -1 seems fine.
>
> Oh, indeed. There's a little magic involved. Still, any reason not to
> pass the real end like iomap?
Simplicity: we write append only and so we know that the only cached page we can
eventually hit is the one straddling inode->i_size. So invalidating everything
from that page is safe, and simple.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-29 23:57 ` Damien Le Moal
@ 2023-03-30 0:07 ` Christoph Hellwig
2023-03-30 0:22 ` Damien Le Moal
0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2023-03-30 0:07 UTC (permalink / raw)
To: Damien Le Moal
Cc: Christoph Hellwig, linux-fsdevel, Johannes Thumshirn, Hans Holmberg
On Thu, Mar 30, 2023 at 08:57:56AM +0900, Damien Le Moal wrote:
> Not sure what you mean here. "iomap-using path fix" ?
> Do you mean adding a comment about the fact that zonefs does not fallback to
> doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ?
Making sure that the odd ENOTBLK error code does not leak to userspace
for this case on zonefs.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-30 0:07 ` Christoph Hellwig
@ 2023-03-30 0:22 ` Damien Le Moal
0 siblings, 0 replies; 10+ messages in thread
From: Damien Le Moal @ 2023-03-30 0:22 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, Johannes Thumshirn, Hans Holmberg
On 3/30/23 09:07, Christoph Hellwig wrote:
> On Thu, Mar 30, 2023 at 08:57:56AM +0900, Damien Le Moal wrote:
>> Not sure what you mean here. "iomap-using path fix" ?
>> Do you mean adding a comment about the fact that zonefs does not fallback to
>> doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ?
>
> Making sure that the odd ENOTBLK error code does not leak to userspace
> for this case on zonefs.
OK. Let me check that.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] zonefs: Always invalidate last cache page on append write
2023-03-29 5:58 [PATCH] zonefs: Always invalidate last cache page on append write Damien Le Moal
2023-03-29 6:14 ` Johannes Thumshirn
2023-03-29 8:14 ` Christoph Hellwig
@ 2023-03-29 11:04 ` Hans Holmberg
2 siblings, 0 replies; 10+ messages in thread
From: Hans Holmberg @ 2023-03-29 11:04 UTC (permalink / raw)
To: Damien Le Moal; +Cc: linux-fsdevel, Johannes Thumshirn, Hans Holmberg
Applied the patch on top of 6.3.0-rc4 and it solves the corruption issue I saw,
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Cheers!
On Wed, Mar 29, 2023 at 7:58 AM Damien Le Moal
<damien.lemoal@opensource.wdc.com> wrote:
>
> When a direct append write is executed, the append offset may correspond
> to the last page of an inode which might have been cached already by
> buffered reads, page faults with mmap-read or non-direct readahead.
> To ensure that the on-disk and cached data is consistant for such last
> cached page, make sure to always invalidate it in
> zonefs_file_dio_append(). This invalidation will always be a no-op when
> the device block size is equal to the page size (e.g. 4K).
>
> Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com>
> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> ---
> fs/zonefs/file.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 617e4f9db42e..eeab8b93493b 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -390,6 +390,18 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
> max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
> iov_iter_truncate(from, max);
>
> + /*
> + * If the inode block size (sector size) is smaller than the
> + * page size, we may be appending data belonging to an already
> + * cached last page of the inode. So make sure to invalidate that
> + * last cached page. This will always be a no-op for the case where
> + * the block size is equal to the page size.
> + */
> + ret = invalidate_inode_pages2_range(inode->i_mapping,
> + iocb->ki_pos >> PAGE_SHIFT, -1);
> + if (ret)
> + return ret;
> +
> nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
> if (!nr_pages)
> return 0;
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-03-30 0:22 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-29 5:58 [PATCH] zonefs: Always invalidate last cache page on append write Damien Le Moal
2023-03-29 6:14 ` Johannes Thumshirn
2023-03-29 8:14 ` Christoph Hellwig
2023-03-29 8:27 ` Damien Le Moal
2023-03-29 9:49 ` Damien Le Moal
2023-03-29 23:36 ` Christoph Hellwig
2023-03-29 23:57 ` Damien Le Moal
2023-03-30 0:07 ` Christoph Hellwig
2023-03-30 0:22 ` Damien Le Moal
2023-03-29 11:04 ` Hans Holmberg
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.