linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] f2fs: disble physical prealloc in LSF mount
@ 2019-11-22  8:59 Javier González
  2019-11-25  0:48 ` Damien Le Moal
  0 siblings, 1 reply; 8+ messages in thread
From: Javier González @ 2019-11-22  8:59 UTC (permalink / raw)
  To: jaegeuk, yuchao0
  Cc: linux-f2fs-devel, linux-kernel, damien.lemoal, Javier González

From: Javier González <javier.gonz@samsung.com>

Fix file system corruption when using LFS mount (e.g., in zoned
devices). Seems like the fallback into buffered I/O creates an
inconsistency if the application is assuming both read and write DIO. I
can easily reproduce a corruption with a simple RocksDB test.

Might be that the f2fs_forced_buffered_io path brings some problems too,
but I have not seen other failures besides this one.

Problem reproducible without a zoned block device, simply by forcing
LFS mount:

  $ sudo mkfs.f2fs -f -m /dev/nvme0n1
  $ sudo mount /dev/nvme0n1 /mnt/f2fs
  $ sudo  /opt/rocksdb/db_bench  --benchmarks=fillseq --use_existing_db=0
  --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
  --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
  --block_size=65536

Note that the options that cause the problem are:
  --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true

Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")

Signed-off-by: Javier González <javier.gonz@samsung.com>
---
 fs/f2fs/data.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 5755e897a5f0..b045dd6ab632 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
 			return err;
 	}
 
-	if (direct_io && allow_outplace_dio(inode, iocb, from))
-		return 0;
-
 	if (is_inode_flag_set(inode, FI_NO_PREALLOC))
 		return 0;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] f2fs: disble physical prealloc in LSF mount
  2019-11-22  8:59 [PATCH] f2fs: disble physical prealloc in LSF mount Javier González
@ 2019-11-25  0:48 ` Damien Le Moal
  2019-11-25 19:03   ` Javier González
  0 siblings, 1 reply; 8+ messages in thread
From: Damien Le Moal @ 2019-11-25  0:48 UTC (permalink / raw)
  To: Javier González, jaegeuk, yuchao0
  Cc: linux-f2fs-devel, linux-kernel, Javier González

On 2019/11/22 18:00, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
> 
> Fix file system corruption when using LFS mount (e.g., in zoned
> devices). Seems like the fallback into buffered I/O creates an
> inconsistency if the application is assuming both read and write DIO. I
> can easily reproduce a corruption with a simple RocksDB test.
> 
> Might be that the f2fs_forced_buffered_io path brings some problems too,
> but I have not seen other failures besides this one.
> 
> Problem reproducible without a zoned block device, simply by forcing
> LFS mount:
> 
>   $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>   $ sudo mount /dev/nvme0n1 /mnt/f2fs
>   $ sudo  /opt/rocksdb/db_bench  --benchmarks=fillseq --use_existing_db=0
>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>   --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>   --block_size=65536
> 
> Note that the options that cause the problem are:
>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
> 
> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
> 
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> ---
>  fs/f2fs/data.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 5755e897a5f0..b045dd6ab632 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>  			return err;
>  	}
>  
> -	if (direct_io && allow_outplace_dio(inode, iocb, from))
> -		return 0;

Since for LFS mode, all DIOs can end up out of place, I think that it
may be better to change allow_outplace_dio() to always return true in
the case of LFS mode. So may be something like:

static inline int allow_outplace_dio(struct inode *inode,
			struct kiocb *iocb, struct iov_iter *iter)
{
	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
	int rw = iov_iter_rw(iter);

	return test_opt(sbi, LFS) ||
	 	(rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
}

instead of the original:

static inline int allow_outplace_dio(struct inode *inode,
			struct kiocb *iocb, struct iov_iter *iter)
{
	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
	int rw = iov_iter_rw(iter);

	return (test_opt(sbi, LFS) && (rw == WRITE) &&
				!block_unaligned_IO(inode, iocb, iter));
}

Thoughts ?

> -
>  	if (is_inode_flag_set(inode, FI_NO_PREALLOC))
>  		return 0;
>  
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] f2fs: disble physical prealloc in LSF mount
  2019-11-25  0:48 ` Damien Le Moal
@ 2019-11-25 19:03   ` Javier González
  2019-11-26  2:06     ` Damien Le Moal
  0 siblings, 1 reply; 8+ messages in thread
From: Javier González @ 2019-11-25 19:03 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: jaegeuk, yuchao0, linux-f2fs-devel, linux-kernel, Javier González

On 25.11.2019 00:48, Damien Le Moal wrote:
>On 2019/11/22 18:00, Javier González wrote:
>> From: Javier González <javier.gonz@samsung.com>
>>
>> Fix file system corruption when using LFS mount (e.g., in zoned
>> devices). Seems like the fallback into buffered I/O creates an
>> inconsistency if the application is assuming both read and write DIO. I
>> can easily reproduce a corruption with a simple RocksDB test.
>>
>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>> but I have not seen other failures besides this one.
>>
>> Problem reproducible without a zoned block device, simply by forcing
>> LFS mount:
>>
>>   $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>   $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>   $ sudo  /opt/rocksdb/db_bench  --benchmarks=fillseq --use_existing_db=0
>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>   --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>   --block_size=65536
>>
>> Note that the options that cause the problem are:
>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>
>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>
>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>> ---
>>  fs/f2fs/data.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>> index 5755e897a5f0..b045dd6ab632 100644
>> --- a/fs/f2fs/data.c
>> +++ b/fs/f2fs/data.c
>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>  			return err;
>>  	}
>>
>> -	if (direct_io && allow_outplace_dio(inode, iocb, from))
>> -		return 0;
>
>Since for LFS mode, all DIOs can end up out of place, I think that it
>may be better to change allow_outplace_dio() to always return true in
>the case of LFS mode. So may be something like:
>
>static inline int allow_outplace_dio(struct inode *inode,
>			struct kiocb *iocb, struct iov_iter *iter)
>{
>	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>	int rw = iov_iter_rw(iter);
>
>	return test_opt(sbi, LFS) ||
>	 	(rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>}
>
>instead of the original:
>
>static inline int allow_outplace_dio(struct inode *inode,
>			struct kiocb *iocb, struct iov_iter *iter)
>{
>	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>	int rw = iov_iter_rw(iter);
>
>	return (test_opt(sbi, LFS) && (rw == WRITE) &&
>				!block_unaligned_IO(inode, iocb, iter));
>}
>
>Thoughts ?
>

I see what you mean and it makes sense. However, the problem I am seeing
occurs when allow_outplace_dio() returns true, as this is what creates
the inconsistency between the write being buffered and the read being
DIO.

I did test the code you propose and, as expected, it still triggered the
corruption.

>> -
>>  	if (is_inode_flag_set(inode, FI_NO_PREALLOC))
>>  		return 0;
>>
>>
>
>
>-- 
>Damien Le Moal
>Western Digital Research

Thanks,
Javier

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] f2fs: disble physical prealloc in LSF mount
  2019-11-25 19:03   ` Javier González
@ 2019-11-26  2:06     ` Damien Le Moal
  2019-11-26  3:57       ` Javier González
  0 siblings, 1 reply; 8+ messages in thread
From: Damien Le Moal @ 2019-11-26  2:06 UTC (permalink / raw)
  To: Javier González
  Cc: jaegeuk, yuchao0, linux-f2fs-devel, linux-kernel, Javier González

On 2019/11/26 4:03, Javier González wrote:
> On 25.11.2019 00:48, Damien Le Moal wrote:
>> On 2019/11/22 18:00, Javier González wrote:
>>> From: Javier González <javier.gonz@samsung.com>
>>>
>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>> devices). Seems like the fallback into buffered I/O creates an
>>> inconsistency if the application is assuming both read and write DIO. I
>>> can easily reproduce a corruption with a simple RocksDB test.
>>>
>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>> but I have not seen other failures besides this one.
>>>
>>> Problem reproducible without a zoned block device, simply by forcing
>>> LFS mount:
>>>
>>>   $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>   $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>   $ sudo  /opt/rocksdb/db_bench  --benchmarks=fillseq --use_existing_db=0
>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>   --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>   --block_size=65536
>>>
>>> Note that the options that cause the problem are:
>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>
>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>
>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>> ---
>>>  fs/f2fs/data.c | 3 ---
>>>  1 file changed, 3 deletions(-)
>>>
>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>> index 5755e897a5f0..b045dd6ab632 100644
>>> --- a/fs/f2fs/data.c
>>> +++ b/fs/f2fs/data.c
>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>  			return err;
>>>  	}
>>>
>>> -	if (direct_io && allow_outplace_dio(inode, iocb, from))
>>> -		return 0;
>>
>> Since for LFS mode, all DIOs can end up out of place, I think that it
>> may be better to change allow_outplace_dio() to always return true in
>> the case of LFS mode. So may be something like:
>>
>> static inline int allow_outplace_dio(struct inode *inode,
>> 			struct kiocb *iocb, struct iov_iter *iter)
>> {
>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>> 	int rw = iov_iter_rw(iter);
>>
>> 	return test_opt(sbi, LFS) ||
>> 	 	(rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>> }
>>
>> instead of the original:
>>
>> static inline int allow_outplace_dio(struct inode *inode,
>> 			struct kiocb *iocb, struct iov_iter *iter)
>> {
>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>> 	int rw = iov_iter_rw(iter);
>>
>> 	return (test_opt(sbi, LFS) && (rw == WRITE) &&
>> 				!block_unaligned_IO(inode, iocb, iter));
>> }
>>
>> Thoughts ?
>>
> 
> I see what you mean and it makes sense. However, the problem I am seeing
> occurs when allow_outplace_dio() returns true, as this is what creates
> the inconsistency between the write being buffered and the read being
> DIO.

But if the write is switched to buffered, the DIO read should use the
buffered path too, no ? Since this is all happening under VFS, the
generic DIO read path will not ensure that the buffered writes are
flushed to disk before issuing the direct read, I think. So that would
explain your data corruption, i.e. you are reading stale data on the
device before the buffered writes make it to the media.

> 
> I did test the code you propose and, as expected, it still triggered the
> corruption.
> 
>>> -
>>>  	if (is_inode_flag_set(inode, FI_NO_PREALLOC))
>>>  		return 0;
>>>
>>>
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
> 
> Thanks,
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] f2fs: disble physical prealloc in LSF mount
  2019-11-26  2:06     ` Damien Le Moal
@ 2019-11-26  3:57       ` Javier González
  2019-11-26  6:19         ` Damien Le Moal
  0 siblings, 1 reply; 8+ messages in thread
From: Javier González @ 2019-11-26  3:57 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: jaegeuk, yuchao0, linux-f2fs-devel, linux-kernel, Javier González

On 26.11.2019 02:06, Damien Le Moal wrote:
>On 2019/11/26 4:03, Javier González wrote:
>> On 25.11.2019 00:48, Damien Le Moal wrote:
>>> On 2019/11/22 18:00, Javier González wrote:
>>>> From: Javier González <javier.gonz@samsung.com>
>>>>
>>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>>> devices). Seems like the fallback into buffered I/O creates an
>>>> inconsistency if the application is assuming both read and write DIO. I
>>>> can easily reproduce a corruption with a simple RocksDB test.
>>>>
>>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>>> but I have not seen other failures besides this one.
>>>>
>>>> Problem reproducible without a zoned block device, simply by forcing
>>>> LFS mount:
>>>>
>>>>   $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>>   $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>>   $ sudo  /opt/rocksdb/db_bench  --benchmarks=fillseq --use_existing_db=0
>>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>   --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>>   --block_size=65536
>>>>
>>>> Note that the options that cause the problem are:
>>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>
>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>>
>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>> ---
>>>>  fs/f2fs/data.c | 3 ---
>>>>  1 file changed, 3 deletions(-)
>>>>
>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>>> index 5755e897a5f0..b045dd6ab632 100644
>>>> --- a/fs/f2fs/data.c
>>>> +++ b/fs/f2fs/data.c
>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>>  			return err;
>>>>  	}
>>>>
>>>> -	if (direct_io && allow_outplace_dio(inode, iocb, from))
>>>> -		return 0;
>>>
>>> Since for LFS mode, all DIOs can end up out of place, I think that it
>>> may be better to change allow_outplace_dio() to always return true in
>>> the case of LFS mode. So may be something like:
>>>
>>> static inline int allow_outplace_dio(struct inode *inode,
>>> 			struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>> 	int rw = iov_iter_rw(iter);
>>>
>>> 	return test_opt(sbi, LFS) ||
>>> 	 	(rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>> }
>>>
>>> instead of the original:
>>>
>>> static inline int allow_outplace_dio(struct inode *inode,
>>> 			struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>> 	int rw = iov_iter_rw(iter);
>>>
>>> 	return (test_opt(sbi, LFS) && (rw == WRITE) &&
>>> 				!block_unaligned_IO(inode, iocb, iter));
>>> }
>>>
>>> Thoughts ?
>>>
>>
>> I see what you mean and it makes sense. However, the problem I am seeing
>> occurs when allow_outplace_dio() returns true, as this is what creates
>> the inconsistency between the write being buffered and the read being
>> DIO.
>
>But if the write is switched to buffered, the DIO read should use the
>buffered path too, no ? Since this is all happening under VFS, the
>generic DIO read path will not ensure that the buffered writes are
>flushed to disk before issuing the direct read, I think. So that would
>explain your data corruption, i.e. you are reading stale data on the
>device before the buffered writes make it to the media.
>

As far as I can see, the read is always sent DIO, so yes, I also believe
that we are reading stale data. This is why the corruption is not seen
if preventing allow_outplace_dio() from sending the write to the
buffered path.

What surprises me is that this is very easy to trigger (see commit), so
I assume you must have seen this with SMR in the past.

Does it make sense to leave the LFS check out of the
allow_outplace_dio()? Or in other words, is there a hard requirement for
writes to take this path on a zoned device that I am not seeing?
Something like:

   static inline int allow_outplace_dio(struct inode *inode,
   			struct kiocb *iocb, struct iov_iter *iter)
   {
   	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
   	int rw = iov_iter_rw(iter);

   	return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
   }

Thanks,
Javier

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] f2fs: disble physical prealloc in LSF mount
  2019-11-26  3:57       ` Javier González
@ 2019-11-26  6:19         ` Damien Le Moal
  2019-11-26  6:20           ` Damien Le Moal
  0 siblings, 1 reply; 8+ messages in thread
From: Damien Le Moal @ 2019-11-26  6:19 UTC (permalink / raw)
  To: Javier González
  Cc: jaegeuk, yuchao0, linux-f2fs-devel, linux-kernel, Javier González

On 2019/11/26 12:58, Javier González wrote:
> On 26.11.2019 02:06, Damien Le Moal wrote:
>> On 2019/11/26 4:03, Javier González wrote:
>>> On 25.11.2019 00:48, Damien Le Moal wrote:
>>>> On 2019/11/22 18:00, Javier González wrote:
>>>>> From: Javier González <javier.gonz@samsung.com>
>>>>>
>>>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>>>> devices). Seems like the fallback into buffered I/O creates an
>>>>> inconsistency if the application is assuming both read and write DIO. I
>>>>> can easily reproduce a corruption with a simple RocksDB test.
>>>>>
>>>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>>>> but I have not seen other failures besides this one.
>>>>>
>>>>> Problem reproducible without a zoned block device, simply by forcing
>>>>> LFS mount:
>>>>>
>>>>>   $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>>>   $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>>>   $ sudo  /opt/rocksdb/db_bench  --benchmarks=fillseq --use_existing_db=0
>>>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>   --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>>>   --block_size=65536
>>>>>
>>>>> Note that the options that cause the problem are:
>>>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>
>>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>>>
>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>> ---
>>>>>  fs/f2fs/data.c | 3 ---
>>>>>  1 file changed, 3 deletions(-)
>>>>>
>>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>>>> index 5755e897a5f0..b045dd6ab632 100644
>>>>> --- a/fs/f2fs/data.c
>>>>> +++ b/fs/f2fs/data.c
>>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>>>  			return err;
>>>>>  	}
>>>>>
>>>>> -	if (direct_io && allow_outplace_dio(inode, iocb, from))
>>>>> -		return 0;
>>>>
>>>> Since for LFS mode, all DIOs can end up out of place, I think that it
>>>> may be better to change allow_outplace_dio() to always return true in
>>>> the case of LFS mode. So may be something like:
>>>>
>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>> 			struct kiocb *iocb, struct iov_iter *iter)
>>>> {
>>>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>> 	int rw = iov_iter_rw(iter);
>>>>
>>>> 	return test_opt(sbi, LFS) ||
>>>> 	 	(rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>>> }
>>>>
>>>> instead of the original:
>>>>
>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>> 			struct kiocb *iocb, struct iov_iter *iter)
>>>> {
>>>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>> 	int rw = iov_iter_rw(iter);
>>>>
>>>> 	return (test_opt(sbi, LFS) && (rw == WRITE) &&
>>>> 				!block_unaligned_IO(inode, iocb, iter));
>>>> }
>>>>
>>>> Thoughts ?
>>>>
>>>
>>> I see what you mean and it makes sense. However, the problem I am seeing
>>> occurs when allow_outplace_dio() returns true, as this is what creates
>>> the inconsistency between the write being buffered and the read being
>>> DIO.
>>
>> But if the write is switched to buffered, the DIO read should use the
>> buffered path too, no ? Since this is all happening under VFS, the
>> generic DIO read path will not ensure that the buffered writes are
>> flushed to disk before issuing the direct read, I think. So that would
>> explain your data corruption, i.e. you are reading stale data on the
>> device before the buffered writes make it to the media.
>>
> 
> As far as I can see, the read is always sent DIO, so yes, I also believe
> that we are reading stale data. This is why the corruption is not seen
> if preventing allow_outplace_dio() from sending the write to the
> buffered path.
> 
> What surprises me is that this is very easy to trigger (see commit), so
> I assume you must have seen this with SMR in the past.

We just did. Shin'Ichiro in my team finally succeeded in recreating the
problem. The cause seems to be:

bool direct_io = iocb->ki_flags & IOCB_DIRECT;

being true on entry of f2fs_preallocate_blocks() whereas
f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with:

if (f2fs_force_buffered_io(inode, iocb, iter))
		return 0;

which has:

	if (f2fs_sb_has_blkzoned(sbi))
		return true;

So the top DIO code says "do buffered IOs", but lower in the write path,
the IO is still assumed to be a DIO because of the iocb flag... That's
inconsistent.

Note that for the non-zoned device LFS case, f2fs_force_buffered_io()
returns true only for unaligned write DIOs... But that will still trip
on the iocb flag test. So the proper fix is likely something like:

int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
{
	struct inode *inode = file_inode(iocb->ki_filp);
	struct f2fs_map_blocks map;
	int flag;
	int err = 0;
-	bool direct_io = iocb->ki_flags & IOCB_DIRECT;
+	bool direct_io = (iocb->ki_flags & IOCB_DIRECT) &&
+		!2fs_force_buffered_io(inode, iocb, iter);

	/* convert inline data for Direct I/O*/
	if (direct_io) {
		err = f2fs_convert_inline_inode(inode);
		if (err)
			return err;
	}

Shin'Ichiro tried this on SMR disks and the failure is gone...

Cheers.


> 
> Does it make sense to leave the LFS check out of the
> allow_outplace_dio()? Or in other words, is there a hard requirement for
> writes to take this path on a zoned device that I am not seeing?
> Something like:
> 
>    static inline int allow_outplace_dio(struct inode *inode,
>    			struct kiocb *iocb, struct iov_iter *iter)
>    {
>    	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>    	int rw = iov_iter_rw(iter);
> 
>    	return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>    }
> 
> Thanks,
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] f2fs: disble physical prealloc in LSF mount
  2019-11-26  6:19         ` Damien Le Moal
@ 2019-11-26  6:20           ` Damien Le Moal
  2019-11-26  6:43             ` Javier González
  0 siblings, 1 reply; 8+ messages in thread
From: Damien Le Moal @ 2019-11-26  6:20 UTC (permalink / raw)
  To: Javier González
  Cc: jaegeuk, yuchao0, linux-f2fs-devel, linux-kernel,
	Javier González, Shinichiro Kawasaki

+ Shin'Ichiro

On 2019/11/26 15:19, Damien Le Moal wrote:
> On 2019/11/26 12:58, Javier González wrote:
>> On 26.11.2019 02:06, Damien Le Moal wrote:
>>> On 2019/11/26 4:03, Javier González wrote:
>>>> On 25.11.2019 00:48, Damien Le Moal wrote:
>>>>> On 2019/11/22 18:00, Javier González wrote:
>>>>>> From: Javier González <javier.gonz@samsung.com>
>>>>>>
>>>>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>>>>> devices). Seems like the fallback into buffered I/O creates an
>>>>>> inconsistency if the application is assuming both read and write DIO. I
>>>>>> can easily reproduce a corruption with a simple RocksDB test.
>>>>>>
>>>>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>>>>> but I have not seen other failures besides this one.
>>>>>>
>>>>>> Problem reproducible without a zoned block device, simply by forcing
>>>>>> LFS mount:
>>>>>>
>>>>>>   $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>>>>   $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>>>>   $ sudo  /opt/rocksdb/db_bench  --benchmarks=fillseq --use_existing_db=0
>>>>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>>   --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>>>>   --block_size=65536
>>>>>>
>>>>>> Note that the options that cause the problem are:
>>>>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>>
>>>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>>>>
>>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>>> ---
>>>>>>  fs/f2fs/data.c | 3 ---
>>>>>>  1 file changed, 3 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>>>>> index 5755e897a5f0..b045dd6ab632 100644
>>>>>> --- a/fs/f2fs/data.c
>>>>>> +++ b/fs/f2fs/data.c
>>>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>>>>  			return err;
>>>>>>  	}
>>>>>>
>>>>>> -	if (direct_io && allow_outplace_dio(inode, iocb, from))
>>>>>> -		return 0;
>>>>>
>>>>> Since for LFS mode, all DIOs can end up out of place, I think that it
>>>>> may be better to change allow_outplace_dio() to always return true in
>>>>> the case of LFS mode. So may be something like:
>>>>>
>>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>>> 			struct kiocb *iocb, struct iov_iter *iter)
>>>>> {
>>>>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>>> 	int rw = iov_iter_rw(iter);
>>>>>
>>>>> 	return test_opt(sbi, LFS) ||
>>>>> 	 	(rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>>>> }
>>>>>
>>>>> instead of the original:
>>>>>
>>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>>> 			struct kiocb *iocb, struct iov_iter *iter)
>>>>> {
>>>>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>>> 	int rw = iov_iter_rw(iter);
>>>>>
>>>>> 	return (test_opt(sbi, LFS) && (rw == WRITE) &&
>>>>> 				!block_unaligned_IO(inode, iocb, iter));
>>>>> }
>>>>>
>>>>> Thoughts ?
>>>>>
>>>>
>>>> I see what you mean and it makes sense. However, the problem I am seeing
>>>> occurs when allow_outplace_dio() returns true, as this is what creates
>>>> the inconsistency between the write being buffered and the read being
>>>> DIO.
>>>
>>> But if the write is switched to buffered, the DIO read should use the
>>> buffered path too, no ? Since this is all happening under VFS, the
>>> generic DIO read path will not ensure that the buffered writes are
>>> flushed to disk before issuing the direct read, I think. So that would
>>> explain your data corruption, i.e. you are reading stale data on the
>>> device before the buffered writes make it to the media.
>>>
>>
>> As far as I can see, the read is always sent DIO, so yes, I also believe
>> that we are reading stale data. This is why the corruption is not seen
>> if preventing allow_outplace_dio() from sending the write to the
>> buffered path.
>>
>> What surprises me is that this is very easy to trigger (see commit), so
>> I assume you must have seen this with SMR in the past.
> 
> We just did. Shin'Ichiro in my team finally succeeded in recreating the
> problem. The cause seems to be:
> 
> bool direct_io = iocb->ki_flags & IOCB_DIRECT;
> 
> being true on entry of f2fs_preallocate_blocks() whereas
> f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with:
> 
> if (f2fs_force_buffered_io(inode, iocb, iter))
> 		return 0;
> 
> which has:
> 
> 	if (f2fs_sb_has_blkzoned(sbi))
> 		return true;
> 
> So the top DIO code says "do buffered IOs", but lower in the write path,
> the IO is still assumed to be a DIO because of the iocb flag... That's
> inconsistent.
> 
> Note that for the non-zoned device LFS case, f2fs_force_buffered_io()
> returns true only for unaligned write DIOs... But that will still trip
> on the iocb flag test. So the proper fix is likely something like:
> 
> int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
> {
> 	struct inode *inode = file_inode(iocb->ki_filp);
> 	struct f2fs_map_blocks map;
> 	int flag;
> 	int err = 0;
> -	bool direct_io = iocb->ki_flags & IOCB_DIRECT;
> +	bool direct_io = (iocb->ki_flags & IOCB_DIRECT) &&
> +		!2fs_force_buffered_io(inode, iocb, iter);
> 
> 	/* convert inline data for Direct I/O*/
> 	if (direct_io) {
> 		err = f2fs_convert_inline_inode(inode);
> 		if (err)
> 			return err;
> 	}
> 
> Shin'Ichiro tried this on SMR disks and the failure is gone...
> 
> Cheers.
> 
> 
>>
>> Does it make sense to leave the LFS check out of the
>> allow_outplace_dio()? Or in other words, is there a hard requirement for
>> writes to take this path on a zoned device that I am not seeing?
>> Something like:
>>
>>    static inline int allow_outplace_dio(struct inode *inode,
>>    			struct kiocb *iocb, struct iov_iter *iter)
>>    {
>>    	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>    	int rw = iov_iter_rw(iter);
>>
>>    	return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>    }
>>
>> Thanks,
>> Javier
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] f2fs: disble physical prealloc in LSF mount
  2019-11-26  6:20           ` Damien Le Moal
@ 2019-11-26  6:43             ` Javier González
  0 siblings, 0 replies; 8+ messages in thread
From: Javier González @ 2019-11-26  6:43 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: jaegeuk, yuchao0, linux-f2fs-devel, linux-kernel,
	Javier González, Shinichiro Kawasaki

On 26.11.2019 06:20, Damien Le Moal wrote:
>+ Shin'Ichiro
>
>On 2019/11/26 15:19, Damien Le Moal wrote:
>> On 2019/11/26 12:58, Javier González wrote:
>>> On 26.11.2019 02:06, Damien Le Moal wrote:
>>>> On 2019/11/26 4:03, Javier González wrote:
>>>>> On 25.11.2019 00:48, Damien Le Moal wrote:
>>>>>> On 2019/11/22 18:00, Javier González wrote:
>>>>>>> From: Javier González <javier.gonz@samsung.com>
>>>>>>>
>>>>>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>>>>>> devices). Seems like the fallback into buffered I/O creates an
>>>>>>> inconsistency if the application is assuming both read and write DIO. I
>>>>>>> can easily reproduce a corruption with a simple RocksDB test.
>>>>>>>
>>>>>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>>>>>> but I have not seen other failures besides this one.
>>>>>>>
>>>>>>> Problem reproducible without a zoned block device, simply by forcing
>>>>>>> LFS mount:
>>>>>>>
>>>>>>>   $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>>>>>   $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>>>>>   $ sudo  /opt/rocksdb/db_bench  --benchmarks=fillseq --use_existing_db=0
>>>>>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>>>   --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>>>>>   --block_size=65536
>>>>>>>
>>>>>>> Note that the options that cause the problem are:
>>>>>>>   --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>>>
>>>>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>>>>>
>>>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>>>> ---
>>>>>>>  fs/f2fs/data.c | 3 ---
>>>>>>>  1 file changed, 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>>>>>> index 5755e897a5f0..b045dd6ab632 100644
>>>>>>> --- a/fs/f2fs/data.c
>>>>>>> +++ b/fs/f2fs/data.c
>>>>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>>>>>  			return err;
>>>>>>>  	}
>>>>>>>
>>>>>>> -	if (direct_io && allow_outplace_dio(inode, iocb, from))
>>>>>>> -		return 0;
>>>>>>
>>>>>> Since for LFS mode, all DIOs can end up out of place, I think that it
>>>>>> may be better to change allow_outplace_dio() to always return true in
>>>>>> the case of LFS mode. So may be something like:
>>>>>>
>>>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>>>> 			struct kiocb *iocb, struct iov_iter *iter)
>>>>>> {
>>>>>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>>>> 	int rw = iov_iter_rw(iter);
>>>>>>
>>>>>> 	return test_opt(sbi, LFS) ||
>>>>>> 	 	(rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>>>>> }
>>>>>>
>>>>>> instead of the original:
>>>>>>
>>>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>>>> 			struct kiocb *iocb, struct iov_iter *iter)
>>>>>> {
>>>>>> 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>>>> 	int rw = iov_iter_rw(iter);
>>>>>>
>>>>>> 	return (test_opt(sbi, LFS) && (rw == WRITE) &&
>>>>>> 				!block_unaligned_IO(inode, iocb, iter));
>>>>>> }
>>>>>>
>>>>>> Thoughts ?
>>>>>>
>>>>>
>>>>> I see what you mean and it makes sense. However, the problem I am seeing
>>>>> occurs when allow_outplace_dio() returns true, as this is what creates
>>>>> the inconsistency between the write being buffered and the read being
>>>>> DIO.
>>>>
>>>> But if the write is switched to buffered, the DIO read should use the
>>>> buffered path too, no ? Since this is all happening under VFS, the
>>>> generic DIO read path will not ensure that the buffered writes are
>>>> flushed to disk before issuing the direct read, I think. So that would
>>>> explain your data corruption, i.e. you are reading stale data on the
>>>> device before the buffered writes make it to the media.
>>>>
>>>
>>> As far as I can see, the read is always sent DIO, so yes, I also believe
>>> that we are reading stale data. This is why the corruption is not seen
>>> if preventing allow_outplace_dio() from sending the write to the
>>> buffered path.
>>>
>>> What surprises me is that this is very easy to trigger (see commit), so
>>> I assume you must have seen this with SMR in the past.
>>
>> We just did. Shin'Ichiro in my team finally succeeded in recreating the
>> problem. The cause seems to be:
>>
>> bool direct_io = iocb->ki_flags & IOCB_DIRECT;
>>
>> being true on entry of f2fs_preallocate_blocks() whereas
>> f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with:
>>
>> if (f2fs_force_buffered_io(inode, iocb, iter))
>> 		return 0;
>>
>> which has:
>>
>> 	if (f2fs_sb_has_blkzoned(sbi))
>> 		return true;
>>
>> So the top DIO code says "do buffered IOs", but lower in the write path,
>> the IO is still assumed to be a DIO because of the iocb flag... That's
>> inconsistent.
>>
>> Note that for the non-zoned device LFS case, f2fs_force_buffered_io()
>> returns true only for unaligned write DIOs... But that will still trip
>> on the iocb flag test. So the proper fix is likely something like:
>>
>> int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>> {
>> 	struct inode *inode = file_inode(iocb->ki_filp);
>> 	struct f2fs_map_blocks map;
>> 	int flag;
>> 	int err = 0;
>> -	bool direct_io = iocb->ki_flags & IOCB_DIRECT;
>> +	bool direct_io = (iocb->ki_flags & IOCB_DIRECT) &&
>> +		!2fs_force_buffered_io(inode, iocb, iter);
>>
>> 	/* convert inline data for Direct I/O*/
>> 	if (direct_io) {
>> 		err = f2fs_convert_inline_inode(inode);
>> 		if (err)
>> 			return err;
>> 	}
>>
>> Shin'Ichiro tried this on SMR disks and the failure is gone...
>>
>> Cheers.
>>

Yes! This is it. I originally though that the problem was on
f2fs_force_buffered_io(), but could not hit the problem there. Thanks
for the analysis; it makes sense now.

Just tested your patch on our drives and the problem is gone too. Guess
you can send a new patch an ignore this one. You can set my reviewed-by
on it.

Thanks Damien!
Javier

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-11-26  6:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-22  8:59 [PATCH] f2fs: disble physical prealloc in LSF mount Javier González
2019-11-25  0:48 ` Damien Le Moal
2019-11-25 19:03   ` Javier González
2019-11-26  2:06     ` Damien Le Moal
2019-11-26  3:57       ` Javier González
2019-11-26  6:19         ` Damien Le Moal
2019-11-26  6:20           ` Damien Le Moal
2019-11-26  6:43             ` Javier González

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).