* [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount @ 2019-11-22 8:59 Javier González 2019-11-25 0:48 ` Damien Le Moal 0 siblings, 1 reply; 8+ messages in thread From: Javier González @ 2019-11-22 8:59 UTC (permalink / raw) To: jaegeuk, yuchao0 Cc: Javier González, damien.lemoal, linux-kernel, linux-f2fs-devel From: Javier González <javier.gonz@samsung.com> Fix file system corruption when using LFS mount (e.g., in zoned devices). Seems like the fallback into buffered I/O creates an inconsistency if the application is assuming both read and write DIO. I can easily reproduce a corruption with a simple RocksDB test. Might be that the f2fs_forced_buffered_io path brings some problems too, but I have not seen other failures besides this one. Problem reproducible without a zoned block device, simply by forcing LFS mount: $ sudo mkfs.f2fs -f -m /dev/nvme0n1 $ sudo mount /dev/nvme0n1 /mnt/f2fs $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0 --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1 --block_size=65536 Note that the options that cause the problem are: --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write") Signed-off-by: Javier González <javier.gonz@samsung.com> --- fs/f2fs/data.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 5755e897a5f0..b045dd6ab632 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) return err; } - if (direct_io && allow_outplace_dio(inode, iocb, from)) - return 0; - if (is_inode_flag_set(inode, FI_NO_PREALLOC)) return 0; -- 2.17.1 _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount 2019-11-22 8:59 [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount Javier González @ 2019-11-25 0:48 ` Damien Le Moal 2019-11-25 19:03 ` Javier González 0 siblings, 1 reply; 8+ messages in thread From: Damien Le Moal @ 2019-11-25 0:48 UTC (permalink / raw) To: Javier González, jaegeuk, yuchao0 Cc: Javier González, linux-kernel, linux-f2fs-devel On 2019/11/22 18:00, Javier González wrote: > From: Javier González <javier.gonz@samsung.com> > > Fix file system corruption when using LFS mount (e.g., in zoned > devices). Seems like the fallback into buffered I/O creates an > inconsistency if the application is assuming both read and write DIO. I > can easily reproduce a corruption with a simple RocksDB test. > > Might be that the f2fs_forced_buffered_io path brings some problems too, > but I have not seen other failures besides this one. > > Problem reproducible without a zoned block device, simply by forcing > LFS mount: > > $ sudo mkfs.f2fs -f -m /dev/nvme0n1 > $ sudo mount /dev/nvme0n1 /mnt/f2fs > $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0 > --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true > --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1 > --block_size=65536 > > Note that the options that cause the problem are: > --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true > > Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write") > > Signed-off-by: Javier González <javier.gonz@samsung.com> > --- > fs/f2fs/data.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index 5755e897a5f0..b045dd6ab632 100644 > --- a/fs/f2fs/data.c > +++ b/fs/f2fs/data.c > @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) > return err; > } > > - if (direct_io && allow_outplace_dio(inode, iocb, from)) > - return 0; Since for LFS mode, all DIOs can end up out of place, I think that it may be better to change allow_outplace_dio() to always return true in the case of LFS mode. So may be something like: static inline int allow_outplace_dio(struct inode *inode, struct kiocb *iocb, struct iov_iter *iter) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); int rw = iov_iter_rw(iter); return test_opt(sbi, LFS) || (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); } instead of the original: static inline int allow_outplace_dio(struct inode *inode, struct kiocb *iocb, struct iov_iter *iter) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); int rw = iov_iter_rw(iter); return (test_opt(sbi, LFS) && (rw == WRITE) && !block_unaligned_IO(inode, iocb, iter)); } Thoughts ? > - > if (is_inode_flag_set(inode, FI_NO_PREALLOC)) > return 0; > > -- Damien Le Moal Western Digital Research _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount 2019-11-25 0:48 ` Damien Le Moal @ 2019-11-25 19:03 ` Javier González 2019-11-26 2:06 ` Damien Le Moal 0 siblings, 1 reply; 8+ messages in thread From: Javier González @ 2019-11-25 19:03 UTC (permalink / raw) To: Damien Le Moal Cc: jaegeuk, linux-kernel, Javier González, linux-f2fs-devel On 25.11.2019 00:48, Damien Le Moal wrote: >On 2019/11/22 18:00, Javier González wrote: >> From: Javier González <javier.gonz@samsung.com> >> >> Fix file system corruption when using LFS mount (e.g., in zoned >> devices). Seems like the fallback into buffered I/O creates an >> inconsistency if the application is assuming both read and write DIO. I >> can easily reproduce a corruption with a simple RocksDB test. >> >> Might be that the f2fs_forced_buffered_io path brings some problems too, >> but I have not seen other failures besides this one. >> >> Problem reproducible without a zoned block device, simply by forcing >> LFS mount: >> >> $ sudo mkfs.f2fs -f -m /dev/nvme0n1 >> $ sudo mount /dev/nvme0n1 /mnt/f2fs >> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0 >> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1 >> --block_size=65536 >> >> Note that the options that cause the problem are: >> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >> >> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write") >> >> Signed-off-by: Javier González <javier.gonz@samsung.com> >> --- >> fs/f2fs/data.c | 3 --- >> 1 file changed, 3 deletions(-) >> >> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c >> index 5755e897a5f0..b045dd6ab632 100644 >> --- a/fs/f2fs/data.c >> +++ b/fs/f2fs/data.c >> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) >> return err; >> } >> >> - if (direct_io && allow_outplace_dio(inode, iocb, from)) >> - return 0; > >Since for LFS mode, all DIOs can end up out of place, I think that it >may be better to change allow_outplace_dio() to always return true in >the case of LFS mode. So may be something like: > >static inline int allow_outplace_dio(struct inode *inode, > struct kiocb *iocb, struct iov_iter *iter) >{ > struct f2fs_sb_info *sbi = F2FS_I_SB(inode); > int rw = iov_iter_rw(iter); > > return test_opt(sbi, LFS) || > (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); >} > >instead of the original: > >static inline int allow_outplace_dio(struct inode *inode, > struct kiocb *iocb, struct iov_iter *iter) >{ > struct f2fs_sb_info *sbi = F2FS_I_SB(inode); > int rw = iov_iter_rw(iter); > > return (test_opt(sbi, LFS) && (rw == WRITE) && > !block_unaligned_IO(inode, iocb, iter)); >} > >Thoughts ? > I see what you mean and it makes sense. However, the problem I am seeing occurs when allow_outplace_dio() returns true, as this is what creates the inconsistency between the write being buffered and the read being DIO. I did test the code you propose and, as expected, it still triggered the corruption. >> - >> if (is_inode_flag_set(inode, FI_NO_PREALLOC)) >> return 0; >> >> > > >-- >Damien Le Moal >Western Digital Research Thanks, Javier _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount 2019-11-25 19:03 ` Javier González @ 2019-11-26 2:06 ` Damien Le Moal 2019-11-26 3:57 ` Javier González 0 siblings, 1 reply; 8+ messages in thread From: Damien Le Moal @ 2019-11-26 2:06 UTC (permalink / raw) To: Javier González Cc: jaegeuk, linux-kernel, Javier González, linux-f2fs-devel On 2019/11/26 4:03, Javier González wrote: > On 25.11.2019 00:48, Damien Le Moal wrote: >> On 2019/11/22 18:00, Javier González wrote: >>> From: Javier González <javier.gonz@samsung.com> >>> >>> Fix file system corruption when using LFS mount (e.g., in zoned >>> devices). Seems like the fallback into buffered I/O creates an >>> inconsistency if the application is assuming both read and write DIO. I >>> can easily reproduce a corruption with a simple RocksDB test. >>> >>> Might be that the f2fs_forced_buffered_io path brings some problems too, >>> but I have not seen other failures besides this one. >>> >>> Problem reproducible without a zoned block device, simply by forcing >>> LFS mount: >>> >>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1 >>> $ sudo mount /dev/nvme0n1 /mnt/f2fs >>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0 >>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1 >>> --block_size=65536 >>> >>> Note that the options that cause the problem are: >>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>> >>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write") >>> >>> Signed-off-by: Javier González <javier.gonz@samsung.com> >>> --- >>> fs/f2fs/data.c | 3 --- >>> 1 file changed, 3 deletions(-) >>> >>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c >>> index 5755e897a5f0..b045dd6ab632 100644 >>> --- a/fs/f2fs/data.c >>> +++ b/fs/f2fs/data.c >>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) >>> return err; >>> } >>> >>> - if (direct_io && allow_outplace_dio(inode, iocb, from)) >>> - return 0; >> >> Since for LFS mode, all DIOs can end up out of place, I think that it >> may be better to change allow_outplace_dio() to always return true in >> the case of LFS mode. So may be something like: >> >> static inline int allow_outplace_dio(struct inode *inode, >> struct kiocb *iocb, struct iov_iter *iter) >> { >> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >> int rw = iov_iter_rw(iter); >> >> return test_opt(sbi, LFS) || >> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); >> } >> >> instead of the original: >> >> static inline int allow_outplace_dio(struct inode *inode, >> struct kiocb *iocb, struct iov_iter *iter) >> { >> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >> int rw = iov_iter_rw(iter); >> >> return (test_opt(sbi, LFS) && (rw == WRITE) && >> !block_unaligned_IO(inode, iocb, iter)); >> } >> >> Thoughts ? >> > > I see what you mean and it makes sense. However, the problem I am seeing > occurs when allow_outplace_dio() returns true, as this is what creates > the inconsistency between the write being buffered and the read being > DIO. But if the write is switched to buffered, the DIO read should use the buffered path too, no ? Since this is all happening under VFS, the generic DIO read path will not ensure that the buffered writes are flushed to disk before issuing the direct read, I think. So that would explain your data corruption, i.e. you are reading stale data on the device before the buffered writes make it to the media. > > I did test the code you propose and, as expected, it still triggered the > corruption. > >>> - >>> if (is_inode_flag_set(inode, FI_NO_PREALLOC)) >>> return 0; >>> >>> >> >> >> -- >> Damien Le Moal >> Western Digital Research > > Thanks, > Javier > -- Damien Le Moal Western Digital Research _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount 2019-11-26 2:06 ` Damien Le Moal @ 2019-11-26 3:57 ` Javier González 2019-11-26 6:19 ` Damien Le Moal 0 siblings, 1 reply; 8+ messages in thread From: Javier González @ 2019-11-26 3:57 UTC (permalink / raw) To: Damien Le Moal Cc: jaegeuk, linux-kernel, Javier González, linux-f2fs-devel On 26.11.2019 02:06, Damien Le Moal wrote: >On 2019/11/26 4:03, Javier González wrote: >> On 25.11.2019 00:48, Damien Le Moal wrote: >>> On 2019/11/22 18:00, Javier González wrote: >>>> From: Javier González <javier.gonz@samsung.com> >>>> >>>> Fix file system corruption when using LFS mount (e.g., in zoned >>>> devices). Seems like the fallback into buffered I/O creates an >>>> inconsistency if the application is assuming both read and write DIO. I >>>> can easily reproduce a corruption with a simple RocksDB test. >>>> >>>> Might be that the f2fs_forced_buffered_io path brings some problems too, >>>> but I have not seen other failures besides this one. >>>> >>>> Problem reproducible without a zoned block device, simply by forcing >>>> LFS mount: >>>> >>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1 >>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs >>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0 >>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1 >>>> --block_size=65536 >>>> >>>> Note that the options that cause the problem are: >>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>>> >>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write") >>>> >>>> Signed-off-by: Javier González <javier.gonz@samsung.com> >>>> --- >>>> fs/f2fs/data.c | 3 --- >>>> 1 file changed, 3 deletions(-) >>>> >>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c >>>> index 5755e897a5f0..b045dd6ab632 100644 >>>> --- a/fs/f2fs/data.c >>>> +++ b/fs/f2fs/data.c >>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) >>>> return err; >>>> } >>>> >>>> - if (direct_io && allow_outplace_dio(inode, iocb, from)) >>>> - return 0; >>> >>> Since for LFS mode, all DIOs can end up out of place, I think that it >>> may be better to change allow_outplace_dio() to always return true in >>> the case of LFS mode. So may be something like: >>> >>> static inline int allow_outplace_dio(struct inode *inode, >>> struct kiocb *iocb, struct iov_iter *iter) >>> { >>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >>> int rw = iov_iter_rw(iter); >>> >>> return test_opt(sbi, LFS) || >>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); >>> } >>> >>> instead of the original: >>> >>> static inline int allow_outplace_dio(struct inode *inode, >>> struct kiocb *iocb, struct iov_iter *iter) >>> { >>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >>> int rw = iov_iter_rw(iter); >>> >>> return (test_opt(sbi, LFS) && (rw == WRITE) && >>> !block_unaligned_IO(inode, iocb, iter)); >>> } >>> >>> Thoughts ? >>> >> >> I see what you mean and it makes sense. However, the problem I am seeing >> occurs when allow_outplace_dio() returns true, as this is what creates >> the inconsistency between the write being buffered and the read being >> DIO. > >But if the write is switched to buffered, the DIO read should use the >buffered path too, no ? Since this is all happening under VFS, the >generic DIO read path will not ensure that the buffered writes are >flushed to disk before issuing the direct read, I think. So that would >explain your data corruption, i.e. you are reading stale data on the >device before the buffered writes make it to the media. > As far as I can see, the read is always sent DIO, so yes, I also believe that we are reading stale data. This is why the corruption is not seen if preventing allow_outplace_dio() from sending the write to the buffered path. What surprises me is that this is very easy to trigger (see commit), so I assume you must have seen this with SMR in the past. Does it make sense to leave the LFS check out of the allow_outplace_dio()? Or in other words, is there a hard requirement for writes to take this path on a zoned device that I am not seeing? Something like: static inline int allow_outplace_dio(struct inode *inode, struct kiocb *iocb, struct iov_iter *iter) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); int rw = iov_iter_rw(iter); return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); } Thanks, Javier _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount 2019-11-26 3:57 ` Javier González @ 2019-11-26 6:19 ` Damien Le Moal 2019-11-26 6:20 ` Damien Le Moal 0 siblings, 1 reply; 8+ messages in thread From: Damien Le Moal @ 2019-11-26 6:19 UTC (permalink / raw) To: Javier González Cc: jaegeuk, linux-kernel, Javier González, linux-f2fs-devel On 2019/11/26 12:58, Javier González wrote: > On 26.11.2019 02:06, Damien Le Moal wrote: >> On 2019/11/26 4:03, Javier González wrote: >>> On 25.11.2019 00:48, Damien Le Moal wrote: >>>> On 2019/11/22 18:00, Javier González wrote: >>>>> From: Javier González <javier.gonz@samsung.com> >>>>> >>>>> Fix file system corruption when using LFS mount (e.g., in zoned >>>>> devices). Seems like the fallback into buffered I/O creates an >>>>> inconsistency if the application is assuming both read and write DIO. I >>>>> can easily reproduce a corruption with a simple RocksDB test. >>>>> >>>>> Might be that the f2fs_forced_buffered_io path brings some problems too, >>>>> but I have not seen other failures besides this one. >>>>> >>>>> Problem reproducible without a zoned block device, simply by forcing >>>>> LFS mount: >>>>> >>>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1 >>>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs >>>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0 >>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1 >>>>> --block_size=65536 >>>>> >>>>> Note that the options that cause the problem are: >>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>>>> >>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write") >>>>> >>>>> Signed-off-by: Javier González <javier.gonz@samsung.com> >>>>> --- >>>>> fs/f2fs/data.c | 3 --- >>>>> 1 file changed, 3 deletions(-) >>>>> >>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c >>>>> index 5755e897a5f0..b045dd6ab632 100644 >>>>> --- a/fs/f2fs/data.c >>>>> +++ b/fs/f2fs/data.c >>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) >>>>> return err; >>>>> } >>>>> >>>>> - if (direct_io && allow_outplace_dio(inode, iocb, from)) >>>>> - return 0; >>>> >>>> Since for LFS mode, all DIOs can end up out of place, I think that it >>>> may be better to change allow_outplace_dio() to always return true in >>>> the case of LFS mode. So may be something like: >>>> >>>> static inline int allow_outplace_dio(struct inode *inode, >>>> struct kiocb *iocb, struct iov_iter *iter) >>>> { >>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >>>> int rw = iov_iter_rw(iter); >>>> >>>> return test_opt(sbi, LFS) || >>>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); >>>> } >>>> >>>> instead of the original: >>>> >>>> static inline int allow_outplace_dio(struct inode *inode, >>>> struct kiocb *iocb, struct iov_iter *iter) >>>> { >>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >>>> int rw = iov_iter_rw(iter); >>>> >>>> return (test_opt(sbi, LFS) && (rw == WRITE) && >>>> !block_unaligned_IO(inode, iocb, iter)); >>>> } >>>> >>>> Thoughts ? >>>> >>> >>> I see what you mean and it makes sense. However, the problem I am seeing >>> occurs when allow_outplace_dio() returns true, as this is what creates >>> the inconsistency between the write being buffered and the read being >>> DIO. >> >> But if the write is switched to buffered, the DIO read should use the >> buffered path too, no ? Since this is all happening under VFS, the >> generic DIO read path will not ensure that the buffered writes are >> flushed to disk before issuing the direct read, I think. So that would >> explain your data corruption, i.e. you are reading stale data on the >> device before the buffered writes make it to the media. >> > > As far as I can see, the read is always sent DIO, so yes, I also believe > that we are reading stale data. This is why the corruption is not seen > if preventing allow_outplace_dio() from sending the write to the > buffered path. > > What surprises me is that this is very easy to trigger (see commit), so > I assume you must have seen this with SMR in the past. We just did. Shin'Ichiro in my team finally succeeded in recreating the problem. The cause seems to be: bool direct_io = iocb->ki_flags & IOCB_DIRECT; being true on entry of f2fs_preallocate_blocks() whereas f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with: if (f2fs_force_buffered_io(inode, iocb, iter)) return 0; which has: if (f2fs_sb_has_blkzoned(sbi)) return true; So the top DIO code says "do buffered IOs", but lower in the write path, the IO is still assumed to be a DIO because of the iocb flag... That's inconsistent. Note that for the non-zoned device LFS case, f2fs_force_buffered_io() returns true only for unaligned write DIOs... But that will still trip on the iocb flag test. So the proper fix is likely something like: int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) { struct inode *inode = file_inode(iocb->ki_filp); struct f2fs_map_blocks map; int flag; int err = 0; - bool direct_io = iocb->ki_flags & IOCB_DIRECT; + bool direct_io = (iocb->ki_flags & IOCB_DIRECT) && + !2fs_force_buffered_io(inode, iocb, iter); /* convert inline data for Direct I/O*/ if (direct_io) { err = f2fs_convert_inline_inode(inode); if (err) return err; } Shin'Ichiro tried this on SMR disks and the failure is gone... Cheers. > > Does it make sense to leave the LFS check out of the > allow_outplace_dio()? Or in other words, is there a hard requirement for > writes to take this path on a zoned device that I am not seeing? > Something like: > > static inline int allow_outplace_dio(struct inode *inode, > struct kiocb *iocb, struct iov_iter *iter) > { > struct f2fs_sb_info *sbi = F2FS_I_SB(inode); > int rw = iov_iter_rw(iter); > > return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); > } > > Thanks, > Javier > -- Damien Le Moal Western Digital Research _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount 2019-11-26 6:19 ` Damien Le Moal @ 2019-11-26 6:20 ` Damien Le Moal 2019-11-26 6:43 ` Javier González 0 siblings, 1 reply; 8+ messages in thread From: Damien Le Moal @ 2019-11-26 6:20 UTC (permalink / raw) To: Javier González Cc: linux-kernel, linux-f2fs-devel, Javier González, jaegeuk + Shin'Ichiro On 2019/11/26 15:19, Damien Le Moal wrote: > On 2019/11/26 12:58, Javier González wrote: >> On 26.11.2019 02:06, Damien Le Moal wrote: >>> On 2019/11/26 4:03, Javier González wrote: >>>> On 25.11.2019 00:48, Damien Le Moal wrote: >>>>> On 2019/11/22 18:00, Javier González wrote: >>>>>> From: Javier González <javier.gonz@samsung.com> >>>>>> >>>>>> Fix file system corruption when using LFS mount (e.g., in zoned >>>>>> devices). Seems like the fallback into buffered I/O creates an >>>>>> inconsistency if the application is assuming both read and write DIO. I >>>>>> can easily reproduce a corruption with a simple RocksDB test. >>>>>> >>>>>> Might be that the f2fs_forced_buffered_io path brings some problems too, >>>>>> but I have not seen other failures besides this one. >>>>>> >>>>>> Problem reproducible without a zoned block device, simply by forcing >>>>>> LFS mount: >>>>>> >>>>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1 >>>>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs >>>>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0 >>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>>>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1 >>>>>> --block_size=65536 >>>>>> >>>>>> Note that the options that cause the problem are: >>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>>>>> >>>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write") >>>>>> >>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com> >>>>>> --- >>>>>> fs/f2fs/data.c | 3 --- >>>>>> 1 file changed, 3 deletions(-) >>>>>> >>>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c >>>>>> index 5755e897a5f0..b045dd6ab632 100644 >>>>>> --- a/fs/f2fs/data.c >>>>>> +++ b/fs/f2fs/data.c >>>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) >>>>>> return err; >>>>>> } >>>>>> >>>>>> - if (direct_io && allow_outplace_dio(inode, iocb, from)) >>>>>> - return 0; >>>>> >>>>> Since for LFS mode, all DIOs can end up out of place, I think that it >>>>> may be better to change allow_outplace_dio() to always return true in >>>>> the case of LFS mode. So may be something like: >>>>> >>>>> static inline int allow_outplace_dio(struct inode *inode, >>>>> struct kiocb *iocb, struct iov_iter *iter) >>>>> { >>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >>>>> int rw = iov_iter_rw(iter); >>>>> >>>>> return test_opt(sbi, LFS) || >>>>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); >>>>> } >>>>> >>>>> instead of the original: >>>>> >>>>> static inline int allow_outplace_dio(struct inode *inode, >>>>> struct kiocb *iocb, struct iov_iter *iter) >>>>> { >>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >>>>> int rw = iov_iter_rw(iter); >>>>> >>>>> return (test_opt(sbi, LFS) && (rw == WRITE) && >>>>> !block_unaligned_IO(inode, iocb, iter)); >>>>> } >>>>> >>>>> Thoughts ? >>>>> >>>> >>>> I see what you mean and it makes sense. However, the problem I am seeing >>>> occurs when allow_outplace_dio() returns true, as this is what creates >>>> the inconsistency between the write being buffered and the read being >>>> DIO. >>> >>> But if the write is switched to buffered, the DIO read should use the >>> buffered path too, no ? Since this is all happening under VFS, the >>> generic DIO read path will not ensure that the buffered writes are >>> flushed to disk before issuing the direct read, I think. So that would >>> explain your data corruption, i.e. you are reading stale data on the >>> device before the buffered writes make it to the media. >>> >> >> As far as I can see, the read is always sent DIO, so yes, I also believe >> that we are reading stale data. This is why the corruption is not seen >> if preventing allow_outplace_dio() from sending the write to the >> buffered path. >> >> What surprises me is that this is very easy to trigger (see commit), so >> I assume you must have seen this with SMR in the past. > > We just did. Shin'Ichiro in my team finally succeeded in recreating the > problem. The cause seems to be: > > bool direct_io = iocb->ki_flags & IOCB_DIRECT; > > being true on entry of f2fs_preallocate_blocks() whereas > f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with: > > if (f2fs_force_buffered_io(inode, iocb, iter)) > return 0; > > which has: > > if (f2fs_sb_has_blkzoned(sbi)) > return true; > > So the top DIO code says "do buffered IOs", but lower in the write path, > the IO is still assumed to be a DIO because of the iocb flag... That's > inconsistent. > > Note that for the non-zoned device LFS case, f2fs_force_buffered_io() > returns true only for unaligned write DIOs... But that will still trip > on the iocb flag test. So the proper fix is likely something like: > > int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) > { > struct inode *inode = file_inode(iocb->ki_filp); > struct f2fs_map_blocks map; > int flag; > int err = 0; > - bool direct_io = iocb->ki_flags & IOCB_DIRECT; > + bool direct_io = (iocb->ki_flags & IOCB_DIRECT) && > + !2fs_force_buffered_io(inode, iocb, iter); > > /* convert inline data for Direct I/O*/ > if (direct_io) { > err = f2fs_convert_inline_inode(inode); > if (err) > return err; > } > > Shin'Ichiro tried this on SMR disks and the failure is gone... > > Cheers. > > >> >> Does it make sense to leave the LFS check out of the >> allow_outplace_dio()? Or in other words, is there a hard requirement for >> writes to take this path on a zoned device that I am not seeing? >> Something like: >> >> static inline int allow_outplace_dio(struct inode *inode, >> struct kiocb *iocb, struct iov_iter *iter) >> { >> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >> int rw = iov_iter_rw(iter); >> >> return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); >> } >> >> Thanks, >> Javier >> > > -- Damien Le Moal Western Digital Research _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount 2019-11-26 6:20 ` Damien Le Moal @ 2019-11-26 6:43 ` Javier González 0 siblings, 0 replies; 8+ messages in thread From: Javier González @ 2019-11-26 6:43 UTC (permalink / raw) To: Damien Le Moal Cc: linux-kernel, linux-f2fs-devel, Javier González, jaegeuk On 26.11.2019 06:20, Damien Le Moal wrote: >+ Shin'Ichiro > >On 2019/11/26 15:19, Damien Le Moal wrote: >> On 2019/11/26 12:58, Javier González wrote: >>> On 26.11.2019 02:06, Damien Le Moal wrote: >>>> On 2019/11/26 4:03, Javier González wrote: >>>>> On 25.11.2019 00:48, Damien Le Moal wrote: >>>>>> On 2019/11/22 18:00, Javier González wrote: >>>>>>> From: Javier González <javier.gonz@samsung.com> >>>>>>> >>>>>>> Fix file system corruption when using LFS mount (e.g., in zoned >>>>>>> devices). Seems like the fallback into buffered I/O creates an >>>>>>> inconsistency if the application is assuming both read and write DIO. I >>>>>>> can easily reproduce a corruption with a simple RocksDB test. >>>>>>> >>>>>>> Might be that the f2fs_forced_buffered_io path brings some problems too, >>>>>>> but I have not seen other failures besides this one. >>>>>>> >>>>>>> Problem reproducible without a zoned block device, simply by forcing >>>>>>> LFS mount: >>>>>>> >>>>>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1 >>>>>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs >>>>>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0 >>>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>>>>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1 >>>>>>> --block_size=65536 >>>>>>> >>>>>>> Note that the options that cause the problem are: >>>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true >>>>>>> >>>>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write") >>>>>>> >>>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com> >>>>>>> --- >>>>>>> fs/f2fs/data.c | 3 --- >>>>>>> 1 file changed, 3 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c >>>>>>> index 5755e897a5f0..b045dd6ab632 100644 >>>>>>> --- a/fs/f2fs/data.c >>>>>>> +++ b/fs/f2fs/data.c >>>>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) >>>>>>> return err; >>>>>>> } >>>>>>> >>>>>>> - if (direct_io && allow_outplace_dio(inode, iocb, from)) >>>>>>> - return 0; >>>>>> >>>>>> Since for LFS mode, all DIOs can end up out of place, I think that it >>>>>> may be better to change allow_outplace_dio() to always return true in >>>>>> the case of LFS mode. So may be something like: >>>>>> >>>>>> static inline int allow_outplace_dio(struct inode *inode, >>>>>> struct kiocb *iocb, struct iov_iter *iter) >>>>>> { >>>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >>>>>> int rw = iov_iter_rw(iter); >>>>>> >>>>>> return test_opt(sbi, LFS) || >>>>>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter)); >>>>>> } >>>>>> >>>>>> instead of the original: >>>>>> >>>>>> static inline int allow_outplace_dio(struct inode *inode, >>>>>> struct kiocb *iocb, struct iov_iter *iter) >>>>>> { >>>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode); >>>>>> int rw = iov_iter_rw(iter); >>>>>> >>>>>> return (test_opt(sbi, LFS) && (rw == WRITE) && >>>>>> !block_unaligned_IO(inode, iocb, iter)); >>>>>> } >>>>>> >>>>>> Thoughts ? >>>>>> >>>>> >>>>> I see what you mean and it makes sense. However, the problem I am seeing >>>>> occurs when allow_outplace_dio() returns true, as this is what creates >>>>> the inconsistency between the write being buffered and the read being >>>>> DIO. >>>> >>>> But if the write is switched to buffered, the DIO read should use the >>>> buffered path too, no ? Since this is all happening under VFS, the >>>> generic DIO read path will not ensure that the buffered writes are >>>> flushed to disk before issuing the direct read, I think. So that would >>>> explain your data corruption, i.e. you are reading stale data on the >>>> device before the buffered writes make it to the media. >>>> >>> >>> As far as I can see, the read is always sent DIO, so yes, I also believe >>> that we are reading stale data. This is why the corruption is not seen >>> if preventing allow_outplace_dio() from sending the write to the >>> buffered path. >>> >>> What surprises me is that this is very easy to trigger (see commit), so >>> I assume you must have seen this with SMR in the past. >> >> We just did. Shin'Ichiro in my team finally succeeded in recreating the >> problem. The cause seems to be: >> >> bool direct_io = iocb->ki_flags & IOCB_DIRECT; >> >> being true on entry of f2fs_preallocate_blocks() whereas >> f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with: >> >> if (f2fs_force_buffered_io(inode, iocb, iter)) >> return 0; >> >> which has: >> >> if (f2fs_sb_has_blkzoned(sbi)) >> return true; >> >> So the top DIO code says "do buffered IOs", but lower in the write path, >> the IO is still assumed to be a DIO because of the iocb flag... That's >> inconsistent. >> >> Note that for the non-zoned device LFS case, f2fs_force_buffered_io() >> returns true only for unaligned write DIOs... But that will still trip >> on the iocb flag test. So the proper fix is likely something like: >> >> int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from) >> { >> struct inode *inode = file_inode(iocb->ki_filp); >> struct f2fs_map_blocks map; >> int flag; >> int err = 0; >> - bool direct_io = iocb->ki_flags & IOCB_DIRECT; >> + bool direct_io = (iocb->ki_flags & IOCB_DIRECT) && >> + !2fs_force_buffered_io(inode, iocb, iter); >> >> /* convert inline data for Direct I/O*/ >> if (direct_io) { >> err = f2fs_convert_inline_inode(inode); >> if (err) >> return err; >> } >> >> Shin'Ichiro tried this on SMR disks and the failure is gone... >> >> Cheers. >> Yes! This is it. I originally though that the problem was on f2fs_force_buffered_io(), but could not hit the problem there. Thanks for the analysis; it makes sense now. Just tested your patch on our drives and the problem is gone too. Guess you can send a new patch an ignore this one. You can set my reviewed-by on it. Thanks Damien! Javier _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-11-26 6:43 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-22 8:59 [f2fs-dev] [PATCH] f2fs: disble physical prealloc in LSF mount Javier González 2019-11-25 0:48 ` Damien Le Moal 2019-11-25 19:03 ` Javier González 2019-11-26 2:06 ` Damien Le Moal 2019-11-26 3:57 ` Javier González 2019-11-26 6:19 ` Damien Le Moal 2019-11-26 6:20 ` Damien Le Moal 2019-11-26 6:43 ` Javier González
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).