* [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read @ 2021-04-12 10:23 Jan Kara 2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Jan Kara @ 2021-04-12 10:23 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara Hello, This patch series fixes a bug manifesting as occasional generic/418 failure when direct IO write extending a file races with buffered read. Patch 1 extends iomap DIO code to pass original IO size to ->end_dio handler so that ext4 can tell whether everything has succeeded and we don't need expensive DIO cleanup (possible truncate of leftover blocks beyond i_size). Patch 2 fixes the ext4 bug, patch 3 fixes unrelated problem I've found in ext4 DIO code. Honza ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/3] iomap: Pass original DIO size to completion handler 2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara @ 2021-04-12 10:23 ` Jan Kara 2021-04-12 14:07 ` kernel test robot ` (2 more replies) 2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara 2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara 2 siblings, 3 replies; 13+ messages in thread From: Jan Kara @ 2021-04-12 10:23 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara When extending a file with direct IO write, ext4 needs to know whether IO has succeeded for the whole range that was prepared for it (the common fast path). In that case we can piggy back the orphan list update with the inode size update. In case write was actually shorter than originally intended, we must leave inode on the orphan list until we cleanup blocks beyond i_size. So provide the original IO size to the direct IO ->end_io handler. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/file.c | 3 ++- fs/iomap/direct-io.c | 5 ++++- fs/xfs/xfs_file.c | 1 + fs/zonefs/super.c | 3 ++- include/linux/iomap.h | 4 ++-- 5 files changed, 11 insertions(+), 5 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 194f5d00fa32..2505313d96b0 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -369,7 +369,8 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset, } static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, - int error, unsigned int flags) + ssize_t orig_size, int error, + unsigned int flags) { loff_t offset = iocb->ki_pos; struct inode *inode = file_inode(iocb->ki_filp); diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index bdd0d89bbf0a..a4bbf22f69bc 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -28,6 +28,7 @@ struct iomap_dio { const struct iomap_dio_ops *dops; loff_t i_size; loff_t size; + loff_t orig_size; atomic_t ref; unsigned flags; int error; @@ -85,7 +86,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) ssize_t ret = dio->error; if (dops && dops->end_io) - ret = dops->end_io(iocb, dio->size, ret, dio->flags); + ret = dops->end_io(iocb, dio->size, dio->orig_size, ret, + dio->flags); if (likely(!ret)) { ret = dio->size; @@ -473,6 +475,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, dio->iocb = iocb; atomic_set(&dio->ref, 1); dio->size = 0; + dio->orig_size = count; dio->i_size = i_size_read(inode); dio->dops = dops; dio->error = 0; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index a007ca0711d9..ed23ed56e345 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -442,6 +442,7 @@ static int xfs_dio_write_end_io( struct kiocb *iocb, ssize_t size, + ssize_t orig_size, int error, unsigned flags) { diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c index 049e36c69ed7..e3e0ee4b8c6e 100644 --- a/fs/zonefs/super.c +++ b/fs/zonefs/super.c @@ -652,7 +652,8 @@ static loff_t zonefs_file_llseek(struct file *file, loff_t offset, int whence) } static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size, - int error, unsigned int flags) + ssize_t orig_size, int error, + unsigned int flags) { struct inode *inode = file_inode(iocb->ki_filp); struct zonefs_inode_info *zi = ZONEFS_I(inode); diff --git a/include/linux/iomap.h b/include/linux/iomap.h index d202fd2d0f91..b175641e9540 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -252,8 +252,8 @@ int iomap_writepages(struct address_space *mapping, #define IOMAP_DIO_COW (1 << 1) /* covers COW extent(s) */ struct iomap_dio_ops { - int (*end_io)(struct kiocb *iocb, ssize_t size, int error, - unsigned flags); + int (*end_io)(struct kiocb *iocb, ssize_t size, ssize_t orig_size, + int error, unsigned flags); blk_qc_t (*submit_io)(struct inode *inode, struct iomap *iomap, struct bio *bio, loff_t file_offset); }; -- 2.26.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler 2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara @ 2021-04-12 14:07 ` kernel test robot 2021-04-12 14:12 ` kernel test robot 2021-04-12 16:37 ` kernel test robot 2 siblings, 0 replies; 13+ messages in thread From: kernel test robot @ 2021-04-12 14:07 UTC (permalink / raw) To: Jan Kara, Ted Tso Cc: kbuild-all, clang-built-linux, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara [-- Attachment #1: Type: text/plain, Size: 7126 bytes --] Hi Jan, I love your patch! Yet something to improve: [auto build test ERROR on ext4/dev] [also build test ERROR on xfs-linux/for-next linus/master v5.12-rc7 next-20210409] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524 base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev config: riscv-randconfig-r004-20210412 (attached as .config) compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 9829f5e6b1bca9b61efc629770d28bb9014dec45) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install riscv cross compiling tool for clang build # apt-get install binutils-riscv64-linux-gnu # https://github.com/0day-ci/linux/commit/0d289243d061378ac42188ff5079287885575bb3 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524 git checkout 0d289243d061378ac42188ff5079287885575bb3 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> fs/zonefs/super.c:732:49: error: too few arguments to function call, expected 5, have 4 zonefs_file_write_dio_end_io(iocb, size, ret, 0); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ fs/zonefs/super.c:654:12: note: 'zonefs_file_write_dio_end_io' declared here static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size, ^ >> fs/zonefs/super.c:961:14: error: incompatible function pointer types initializing 'int (*)(struct kiocb *, ssize_t, ssize_t, int, unsigned int)' (aka 'int (*)(struct kiocb *, long, long, int, unsigned int)') with an expression of type 'int (struct kiocb *, ssize_t, int, unsigned int)' (aka 'int (struct kiocb *, long, int, unsigned int)') [-Werror,-Wincompatible-function-pointer-types] .end_io = zonefs_file_read_dio_end_io, ^~~~~~~~~~~~~~~~~~~~~~~~~~~ 2 errors generated. vim +732 fs/zonefs/super.c 8dcc1a9d90c10f Damien Le Moal 2019-12-25 688 02ef12a663c7ac Johannes Thumshirn 2020-05-12 689 static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from) 02ef12a663c7ac Johannes Thumshirn 2020-05-12 690 { 02ef12a663c7ac Johannes Thumshirn 2020-05-12 691 struct inode *inode = file_inode(iocb->ki_filp); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 692 struct zonefs_inode_info *zi = ZONEFS_I(inode); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 693 struct block_device *bdev = inode->i_sb->s_bdev; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 694 unsigned int max; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 695 struct bio *bio; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 696 ssize_t size; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 697 int nr_pages; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 698 ssize_t ret; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 699 02ef12a663c7ac Johannes Thumshirn 2020-05-12 700 max = queue_max_zone_append_sectors(bdev_get_queue(bdev)); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 701 max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 702 iov_iter_truncate(from, max); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 703 a8affc03a9b375 Christoph Hellwig 2021-03-11 704 nr_pages = iov_iter_npages(from, BIO_MAX_VECS); 89ee72376be23a Johannes Thumshirn 2020-07-16 705 if (!nr_pages) 89ee72376be23a Johannes Thumshirn 2020-07-16 706 return 0; 89ee72376be23a Johannes Thumshirn 2020-07-16 707 f91ca2a370bec5 Christoph Hellwig 2021-01-26 708 bio = bio_alloc(GFP_NOFS, nr_pages); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 709 if (!bio) 02ef12a663c7ac Johannes Thumshirn 2020-05-12 710 return -ENOMEM; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 711 02ef12a663c7ac Johannes Thumshirn 2020-05-12 712 bio_set_dev(bio, bdev); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 713 bio->bi_iter.bi_sector = zi->i_zsector; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 714 bio->bi_write_hint = iocb->ki_hint; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 715 bio->bi_ioprio = iocb->ki_ioprio; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 716 bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 717 if (iocb->ki_flags & IOCB_DSYNC) 02ef12a663c7ac Johannes Thumshirn 2020-05-12 718 bio->bi_opf |= REQ_FUA; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 719 02ef12a663c7ac Johannes Thumshirn 2020-05-12 720 ret = bio_iov_iter_get_pages(bio, from); 6bea0225a4bf14 Damien Le Moal 2020-12-09 721 if (unlikely(ret)) 6bea0225a4bf14 Damien Le Moal 2020-12-09 722 goto out_release; 6bea0225a4bf14 Damien Le Moal 2020-12-09 723 02ef12a663c7ac Johannes Thumshirn 2020-05-12 724 size = bio->bi_iter.bi_size; 6bea0225a4bf14 Damien Le Moal 2020-12-09 725 task_io_account_write(size); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 726 02ef12a663c7ac Johannes Thumshirn 2020-05-12 727 if (iocb->ki_flags & IOCB_HIPRI) 02ef12a663c7ac Johannes Thumshirn 2020-05-12 728 bio_set_polled(bio, iocb); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 729 02ef12a663c7ac Johannes Thumshirn 2020-05-12 730 ret = submit_bio_wait(bio); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 731 6bea0225a4bf14 Damien Le Moal 2020-12-09 @732 zonefs_file_write_dio_end_io(iocb, size, ret, 0); 62ab1aadcccd03 Johannes Thumshirn 2021-01-27 733 trace_zonefs_file_dio_append(inode, size, ret); 6bea0225a4bf14 Damien Le Moal 2020-12-09 734 6bea0225a4bf14 Damien Le Moal 2020-12-09 735 out_release: 6bea0225a4bf14 Damien Le Moal 2020-12-09 736 bio_release_pages(bio, false); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 737 bio_put(bio); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 738 02ef12a663c7ac Johannes Thumshirn 2020-05-12 739 if (ret >= 0) { 02ef12a663c7ac Johannes Thumshirn 2020-05-12 740 iocb->ki_pos += size; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 741 return size; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 742 } 02ef12a663c7ac Johannes Thumshirn 2020-05-12 743 02ef12a663c7ac Johannes Thumshirn 2020-05-12 744 return ret; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 745 } 02ef12a663c7ac Johannes Thumshirn 2020-05-12 746 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 33984 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler 2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara 2021-04-12 14:07 ` kernel test robot @ 2021-04-12 14:12 ` kernel test robot 2021-04-12 16:37 ` kernel test robot 2 siblings, 0 replies; 13+ messages in thread From: kernel test robot @ 2021-04-12 14:12 UTC (permalink / raw) To: Jan Kara, Ted Tso Cc: kbuild-all, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara [-- Attachment #1: Type: text/plain, Size: 2980 bytes --] Hi Jan, I love your patch! Yet something to improve: [auto build test ERROR on ext4/dev] [also build test ERROR on xfs-linux/for-next linus/master v5.12-rc7 next-20210409] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524 base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev config: um-allmodconfig (attached as .config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce (this is a W=1 build): # https://github.com/0day-ci/linux/commit/0d289243d061378ac42188ff5079287885575bb3 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524 git checkout 0d289243d061378ac42188ff5079287885575bb3 # save the attached .config to linux build tree make W=1 ARCH=um If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): cc1: warning: arch/um/include/uapi: No such file or directory [-Wmissing-include-dirs] fs/zonefs/super.c: In function 'zonefs_file_dio_append': fs/zonefs/super.c:732:2: error: too few arguments to function 'zonefs_file_write_dio_end_io' 732 | zonefs_file_write_dio_end_io(iocb, size, ret, 0); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/zonefs/super.c:654:12: note: declared here 654 | static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/zonefs/super.c: At top level: >> fs/zonefs/super.c:961:14: error: initialization of 'int (*)(struct kiocb *, ssize_t, ssize_t, int, unsigned int)' {aka 'int (*)(struct kiocb *, long int, long int, int, unsigned int)'} from incompatible pointer type 'int (*)(struct kiocb *, ssize_t, int, unsigned int)' {aka 'int (*)(struct kiocb *, long int, int, unsigned int)'} [-Werror=incompatible-pointer-types] 961 | .end_io = zonefs_file_read_dio_end_io, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/zonefs/super.c:961:14: note: (near initialization for 'zonefs_read_dio_ops.end_io') cc1: some warnings being treated as errors vim +961 fs/zonefs/super.c 8dcc1a9d90c10fa Damien Le Moal 2019-12-25 959 8dcc1a9d90c10fa Damien Le Moal 2019-12-25 960 static const struct iomap_dio_ops zonefs_read_dio_ops = { 8dcc1a9d90c10fa Damien Le Moal 2019-12-25 @961 .end_io = zonefs_file_read_dio_end_io, 8dcc1a9d90c10fa Damien Le Moal 2019-12-25 962 }; 8dcc1a9d90c10fa Damien Le Moal 2019-12-25 963 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 24378 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler 2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara 2021-04-12 14:07 ` kernel test robot 2021-04-12 14:12 ` kernel test robot @ 2021-04-12 16:37 ` kernel test robot 2 siblings, 0 replies; 13+ messages in thread From: kernel test robot @ 2021-04-12 16:37 UTC (permalink / raw) To: Jan Kara, Ted Tso Cc: kbuild-all, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara [-- Attachment #1: Type: text/plain, Size: 7175 bytes --] Hi Jan, I love your patch! Yet something to improve: [auto build test ERROR on ext4/dev] [also build test ERROR on xfs-linux/for-next linus/master v5.12-rc7 next-20210412] [cannot apply to tytso-fscrypt/master] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524 base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev config: openrisc-randconfig-r032-20210412 (attached as .config) compiler: or1k-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/0d289243d061378ac42188ff5079287885575bb3 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524 git checkout 0d289243d061378ac42188ff5079287885575bb3 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=openrisc If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): fs/zonefs/super.c: In function 'zonefs_file_dio_append': >> fs/zonefs/super.c:732:2: error: too few arguments to function 'zonefs_file_write_dio_end_io' 732 | zonefs_file_write_dio_end_io(iocb, size, ret, 0); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/zonefs/super.c:654:12: note: declared here 654 | static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/zonefs/super.c: At top level: >> fs/zonefs/super.c:961:14: error: initialization of 'int (*)(struct kiocb *, ssize_t, ssize_t, int, unsigned int)' {aka 'int (*)(struct kiocb *, int, int, int, unsigned int)'} from incompatible pointer type 'int (*)(struct kiocb *, ssize_t, int, unsigned int)' {aka 'int (*)(struct kiocb *, int, int, unsigned int)'} [-Werror=incompatible-pointer-types] 961 | .end_io = zonefs_file_read_dio_end_io, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/zonefs/super.c:961:14: note: (near initialization for 'zonefs_read_dio_ops.end_io') cc1: some warnings being treated as errors vim +/zonefs_file_write_dio_end_io +732 fs/zonefs/super.c 8dcc1a9d90c10f Damien Le Moal 2019-12-25 688 02ef12a663c7ac Johannes Thumshirn 2020-05-12 689 static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from) 02ef12a663c7ac Johannes Thumshirn 2020-05-12 690 { 02ef12a663c7ac Johannes Thumshirn 2020-05-12 691 struct inode *inode = file_inode(iocb->ki_filp); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 692 struct zonefs_inode_info *zi = ZONEFS_I(inode); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 693 struct block_device *bdev = inode->i_sb->s_bdev; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 694 unsigned int max; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 695 struct bio *bio; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 696 ssize_t size; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 697 int nr_pages; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 698 ssize_t ret; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 699 02ef12a663c7ac Johannes Thumshirn 2020-05-12 700 max = queue_max_zone_append_sectors(bdev_get_queue(bdev)); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 701 max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 702 iov_iter_truncate(from, max); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 703 a8affc03a9b375 Christoph Hellwig 2021-03-11 704 nr_pages = iov_iter_npages(from, BIO_MAX_VECS); 89ee72376be23a Johannes Thumshirn 2020-07-16 705 if (!nr_pages) 89ee72376be23a Johannes Thumshirn 2020-07-16 706 return 0; 89ee72376be23a Johannes Thumshirn 2020-07-16 707 f91ca2a370bec5 Christoph Hellwig 2021-01-26 708 bio = bio_alloc(GFP_NOFS, nr_pages); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 709 if (!bio) 02ef12a663c7ac Johannes Thumshirn 2020-05-12 710 return -ENOMEM; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 711 02ef12a663c7ac Johannes Thumshirn 2020-05-12 712 bio_set_dev(bio, bdev); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 713 bio->bi_iter.bi_sector = zi->i_zsector; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 714 bio->bi_write_hint = iocb->ki_hint; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 715 bio->bi_ioprio = iocb->ki_ioprio; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 716 bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 717 if (iocb->ki_flags & IOCB_DSYNC) 02ef12a663c7ac Johannes Thumshirn 2020-05-12 718 bio->bi_opf |= REQ_FUA; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 719 02ef12a663c7ac Johannes Thumshirn 2020-05-12 720 ret = bio_iov_iter_get_pages(bio, from); 6bea0225a4bf14 Damien Le Moal 2020-12-09 721 if (unlikely(ret)) 6bea0225a4bf14 Damien Le Moal 2020-12-09 722 goto out_release; 6bea0225a4bf14 Damien Le Moal 2020-12-09 723 02ef12a663c7ac Johannes Thumshirn 2020-05-12 724 size = bio->bi_iter.bi_size; 6bea0225a4bf14 Damien Le Moal 2020-12-09 725 task_io_account_write(size); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 726 02ef12a663c7ac Johannes Thumshirn 2020-05-12 727 if (iocb->ki_flags & IOCB_HIPRI) 02ef12a663c7ac Johannes Thumshirn 2020-05-12 728 bio_set_polled(bio, iocb); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 729 02ef12a663c7ac Johannes Thumshirn 2020-05-12 730 ret = submit_bio_wait(bio); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 731 6bea0225a4bf14 Damien Le Moal 2020-12-09 @732 zonefs_file_write_dio_end_io(iocb, size, ret, 0); 62ab1aadcccd03 Johannes Thumshirn 2021-01-27 733 trace_zonefs_file_dio_append(inode, size, ret); 6bea0225a4bf14 Damien Le Moal 2020-12-09 734 6bea0225a4bf14 Damien Le Moal 2020-12-09 735 out_release: 6bea0225a4bf14 Damien Le Moal 2020-12-09 736 bio_release_pages(bio, false); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 737 bio_put(bio); 02ef12a663c7ac Johannes Thumshirn 2020-05-12 738 02ef12a663c7ac Johannes Thumshirn 2020-05-12 739 if (ret >= 0) { 02ef12a663c7ac Johannes Thumshirn 2020-05-12 740 iocb->ki_pos += size; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 741 return size; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 742 } 02ef12a663c7ac Johannes Thumshirn 2020-05-12 743 02ef12a663c7ac Johannes Thumshirn 2020-05-12 744 return ret; 02ef12a663c7ac Johannes Thumshirn 2020-05-12 745 } 02ef12a663c7ac Johannes Thumshirn 2020-05-12 746 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 21556 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/3] ext4: Fix occasional generic/418 failure 2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara 2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara @ 2021-04-12 10:23 ` Jan Kara 2021-04-12 21:50 ` Dave Chinner 2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara 2 siblings, 1 reply; 13+ messages in thread From: Jan Kara @ 2021-04-12 10:23 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara Eric has noticed that after pagecache read rework, generic/418 is occasionally failing for ext4 when blocksize < pagesize. In fact, the pagecache rework just made hard to hit race in ext4 more likely. The problem is that since ext4 conversion of direct IO writes to iomap framework (commit 378f32bab371), we update inode size after direct IO write only after invalidating page cache. Thus if buffered read sneaks at unfortunate moment like: CPU1 - write at offset 1k CPU2 - read from offset 0 iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); ext4_readpage(); ext4_handle_inode_extension() the read will zero out tail of the page as it still sees smaller inode size and thus page cache becomes inconsistent with on-disk contents with all the consequences. Fix the problem by moving inode size update into end_io handler which gets called before the page cache is invalidated. Reported-by: Eric Whitney <enwlinux@gmail.com> Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/file.c | 185 ++++++++++++++++++++++++++----------------------- 1 file changed, 99 insertions(+), 86 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 2505313d96b0..ec59ea078f53 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -280,11 +280,67 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, return ret; } +static int ext4_prepare_dio_extend(struct inode *inode) +{ + handle_t *handle; + int ret; + + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + ext4_fc_start_update(inode); + ret = ext4_orphan_add(handle, inode); + ext4_fc_stop_update(inode); + if (ret) { + ext4_journal_stop(handle); + return ret; + } + ext4_journal_stop(handle); + return 0; +} + +static void ext4_cleanup_dio_extend(struct inode *inode, loff_t offset, + ssize_t written, size_t count) +{ + handle_t *handle; + u8 blkbits = inode->i_blkbits; + ext4_lblk_t written_blk, end_blk; + + written_blk = ALIGN(offset + written, 1 << blkbits); + end_blk = ALIGN(offset + count, 1 << blkbits); + if (written_blk < end_blk && ext4_can_truncate(inode)) { + ext4_truncate_failed_write(inode); + /* + * If the truncate operation failed early, then the inode may + * still be on the orphan list. In that case, we need to try + * remove the inode from the in-memory linked list. + */ + if (inode->i_nlink) + ext4_orphan_del(NULL, inode); + return; + } + + /* + * We need to ensure that the inode is removed from the orphan + * list if it has been added prematurely, due to writeback of + * delalloc blocks. + */ + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) { + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); + if (IS_ERR(handle)) { + ext4_orphan_del(NULL, inode); + return; + } + ext4_orphan_del(handle, inode); + ext4_journal_stop(handle); + } +} + static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset, ssize_t written, size_t count) { handle_t *handle; - bool truncate = false; u8 blkbits = inode->i_blkbits; ext4_lblk_t written_blk, end_blk; int ret; @@ -298,73 +354,31 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset, * as much as we intended. */ WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); - if (offset + count <= EXT4_I(inode)->i_disksize) { - /* - * We need to ensure that the inode is removed from the orphan - * list if it has been added prematurely, due to writeback of - * delalloc blocks. - */ - if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) { - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - - if (IS_ERR(handle)) { - ext4_orphan_del(NULL, inode); - return PTR_ERR(handle); - } - - ext4_orphan_del(handle, inode); - ext4_journal_stop(handle); - } - + if (offset + count <= EXT4_I(inode)->i_disksize) return written; - } - - if (written < 0) - goto truncate; handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - if (IS_ERR(handle)) { - written = PTR_ERR(handle); - goto truncate; - } + if (IS_ERR(handle)) + return PTR_ERR(handle); if (ext4_update_inode_size(inode, offset + written)) { ret = ext4_mark_inode_dirty(handle, inode); if (unlikely(ret)) { - written = ret; ext4_journal_stop(handle); - goto truncate; + return ret; } } /* - * We may need to truncate allocated but not written blocks beyond EOF. + * If there cannot be allocated but unused blocks beyond EOF, remove + * the inode from the orphan list as we are done with it. */ written_blk = ALIGN(offset + written, 1 << blkbits); end_blk = ALIGN(offset + count, 1 << blkbits); - if (written_blk < end_blk && ext4_can_truncate(inode)) - truncate = true; - - /* - * Remove the inode from the orphan list if it has been extended and - * everything went OK. - */ - if (!truncate && inode->i_nlink) + if (written_blk == end_blk && inode->i_nlink) ext4_orphan_del(handle, inode); ext4_journal_stop(handle); - if (truncate) { -truncate: - ext4_truncate_failed_write(inode); - /* - * If the truncate operation failed early, then the inode may - * still be on the orphan list. In that case, we need to try - * remove the inode from the in-memory linked list. - */ - if (inode->i_nlink) - ext4_orphan_del(NULL, inode); - } - return written; } @@ -385,10 +399,28 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, return 0; } +static int ext4_dio_extending_write_end_io(struct kiocb *iocb, ssize_t size, + ssize_t orig_size, int error, + unsigned int flags) +{ + struct inode *inode = file_inode(iocb->ki_filp); + int ret; + + ret = ext4_dio_write_end_io(iocb, size, orig_size, error, flags); + if (ret < 0) + return ret; + return ext4_handle_inode_extension(inode, iocb->ki_pos, size, + orig_size); +} + static const struct iomap_dio_ops ext4_dio_write_ops = { .end_io = ext4_dio_write_end_io, }; +static const struct iomap_dio_ops ext4_dio_extending_write_ops = { + .end_io = ext4_dio_extending_write_end_io, +}; + /* * The intention here is to start with shared lock acquired then see if any * condition requires an exclusive inode lock. If yes, then we restart the @@ -455,11 +487,11 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from, static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) { ssize_t ret; - handle_t *handle; struct inode *inode = file_inode(iocb->ki_filp); loff_t offset = iocb->ki_pos; size_t count = iov_iter_count(from); const struct iomap_ops *iomap_ops = &ext4_iomap_ops; + const struct iomap_dio_ops *dio_ops = &ext4_dio_write_ops; bool extend = false, unaligned_io = false; bool ilock_shared = true; @@ -530,33 +562,21 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) inode_dio_wait(inode); if (extend) { - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); + ret = ext4_prepare_dio_extend(inode); + if (ret) goto out; - } - - ext4_fc_start_update(inode); - ret = ext4_orphan_add(handle, inode); - ext4_fc_stop_update(inode); - if (ret) { - ext4_journal_stop(handle); - goto out; - } - - ext4_journal_stop(handle); } if (ilock_shared) iomap_ops = &ext4_iomap_overwrite_ops; - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); - if (ret == -ENOTBLK) - ret = 0; - if (extend) - ret = ext4_handle_inode_extension(inode, offset, ret, count); + dio_ops = &ext4_dio_extending_write_ops; + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); + if (ret == -ENOTBLK) + ret = 0; + ext4_cleanup_dio_extend(inode, offset, ret, count); out: if (ilock_shared) inode_unlock_shared(inode); @@ -599,7 +619,6 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) ssize_t ret; size_t count; loff_t offset; - handle_t *handle; bool extend = false; struct inode *inode = file_inode(iocb->ki_filp); @@ -618,26 +637,20 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) count = iov_iter_count(from); if (offset + count > EXT4_I(inode)->i_disksize) { - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto out; - } - - ret = ext4_orphan_add(handle, inode); - if (ret) { - ext4_journal_stop(handle); + ret = ext4_prepare_dio_extend(inode); + if (ret < 0) goto out; - } - extend = true; - ext4_journal_stop(handle); } ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops); - if (extend) - ret = ext4_handle_inode_extension(inode, offset, ret, count); + if (extend) { + if (ret > 0) + ret = ext4_handle_inode_extension(inode, offset, ret, + count); + ext4_cleanup_dio_extend(inode, offset, ret, count); + } out: inode_unlock(inode); if (ret > 0) -- 2.26.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure 2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara @ 2021-04-12 21:50 ` Dave Chinner 2021-04-13 9:11 ` Jan Kara 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2021-04-12 21:50 UTC (permalink / raw) To: Jan Kara Cc: Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote: > Eric has noticed that after pagecache read rework, generic/418 is > occasionally failing for ext4 when blocksize < pagesize. In fact, the > pagecache rework just made hard to hit race in ext4 more likely. The > problem is that since ext4 conversion of direct IO writes to iomap > framework (commit 378f32bab371), we update inode size after direct IO > write only after invalidating page cache. Thus if buffered read sneaks > at unfortunate moment like: > > CPU1 - write at offset 1k CPU2 - read from offset 0 > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); > ext4_readpage(); > ext4_handle_inode_extension() > > the read will zero out tail of the page as it still sees smaller inode > size and thus page cache becomes inconsistent with on-disk contents with > all the consequences. > > Fix the problem by moving inode size update into end_io handler which > gets called before the page cache is invalidated. Confused. This moves all the inode extension stuff into the completion handler, when all that really needs to be done is extending inode->i_size to tell the world there is data up to where the IO completed. Actually removing the inode from the orphan list does not need to be done in the IO completion callback, because... > if (ilock_shared) > iomap_ops = &ext4_iomap_overwrite_ops; > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); > - if (ret == -ENOTBLK) > - ret = 0; > - > if (extend) > - ret = ext4_handle_inode_extension(inode, offset, ret, count); > + dio_ops = &ext4_dio_extending_write_ops; > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); ^^^^^^ ^^^^^^^^^^^^^^^^^^^ .... if we are doing an extending write, we force DIO to complete before returning. Hence even AIO will block here on an extending write, and hence we can -always- do the correct post-IO completion orphan list cleanup here because we know a) the original IO size and b) the amount of data that was actually written. Hence all that remains is closing the buffered read vs invalidation race. All this requires is for the dio write completion to behave like XFS where it just does the inode->i_size update for extending writes. THis means the size is updated before the invalidation, and hence any read that occurs after the invalidation but before the post-eof blocks have been removed will see the correct size and read the tail page(s) correctly. This closes the race window, and the caller can still handle the post-eof block cleanup as it does now. Hence I don't see any need for changing the iomap infrastructure to solve this problem. This seems like the obvious solution to me, so what am I missing? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure 2021-04-12 21:50 ` Dave Chinner @ 2021-04-13 9:11 ` Jan Kara 2021-04-13 22:45 ` Dave Chinner 0 siblings, 1 reply; 13+ messages in thread From: Jan Kara @ 2021-04-13 9:11 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong On Tue 13-04-21 07:50:24, Dave Chinner wrote: > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote: > > Eric has noticed that after pagecache read rework, generic/418 is > > occasionally failing for ext4 when blocksize < pagesize. In fact, the > > pagecache rework just made hard to hit race in ext4 more likely. The > > problem is that since ext4 conversion of direct IO writes to iomap > > framework (commit 378f32bab371), we update inode size after direct IO > > write only after invalidating page cache. Thus if buffered read sneaks > > at unfortunate moment like: > > > > CPU1 - write at offset 1k CPU2 - read from offset 0 > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); > > ext4_readpage(); > > ext4_handle_inode_extension() > > > > the read will zero out tail of the page as it still sees smaller inode > > size and thus page cache becomes inconsistent with on-disk contents with > > all the consequences. > > > > Fix the problem by moving inode size update into end_io handler which > > gets called before the page cache is invalidated. > > Confused. > > This moves all the inode extension stuff into the completion > handler, when all that really needs to be done is extending > inode->i_size to tell the world there is data up to where the > IO completed. Actually removing the inode from the orphan list > does not need to be done in the IO completion callback, because... > > > if (ilock_shared) > > iomap_ops = &ext4_iomap_overwrite_ops; > > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, > > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); > > - if (ret == -ENOTBLK) > > - ret = 0; > > - > > if (extend) > > - ret = ext4_handle_inode_extension(inode, offset, ret, count); > > + dio_ops = &ext4_dio_extending_write_ops; > > > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, > > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); > ^^^^^^ ^^^^^^^^^^^^^^^^^^^ > > .... if we are doing an extending write, we force DIO to complete > before returning. Hence even AIO will block here on an extending > write, and hence we can -always- do the correct post-IO completion > orphan list cleanup here because we know a) the original IO size and > b) the amount of data that was actually written. > > Hence all that remains is closing the buffered read vs invalidation > race. All this requires is for the dio write completion to behave > like XFS where it just does the inode->i_size update for extending > writes. THis means the size is updated before the invalidation, and > hence any read that occurs after the invalidation but before the > post-eof blocks have been removed will see the correct size and read > the tail page(s) correctly. This closes the race window, and the > caller can still handle the post-eof block cleanup as it does now. > > Hence I don't see any need for changing the iomap infrastructure to > solve this problem. This seems like the obvious solution to me, so > what am I missing? All that you write above is correct. The missing piece is: If everything succeeded and all the cleanup we need is removing inode from the orphan list (common case), we want to piggyback that orphan list removal into the same transaction handle as the update of the inode size. This is just a performance thing, you are absolutely right we could also do the orphan cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes to the iomap framework. OK, now that I write about this, maybe I was just too hung up on the performance improvement. Probably a better way forward is that I just fix the data corruption bug only inside ext4 (that will be also much easier to backport) and then submit the performance improvement modifying iomap if I can actually get performance data justifying it. Thanks for poking into this :) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure 2021-04-13 9:11 ` Jan Kara @ 2021-04-13 22:45 ` Dave Chinner 2021-04-14 11:56 ` Jan Kara 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2021-04-13 22:45 UTC (permalink / raw) To: Jan Kara Cc: Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote: > On Tue 13-04-21 07:50:24, Dave Chinner wrote: > > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote: > > > Eric has noticed that after pagecache read rework, generic/418 is > > > occasionally failing for ext4 when blocksize < pagesize. In fact, the > > > pagecache rework just made hard to hit race in ext4 more likely. The > > > problem is that since ext4 conversion of direct IO writes to iomap > > > framework (commit 378f32bab371), we update inode size after direct IO > > > write only after invalidating page cache. Thus if buffered read sneaks > > > at unfortunate moment like: > > > > > > CPU1 - write at offset 1k CPU2 - read from offset 0 > > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); > > > ext4_readpage(); > > > ext4_handle_inode_extension() > > > > > > the read will zero out tail of the page as it still sees smaller inode > > > size and thus page cache becomes inconsistent with on-disk contents with > > > all the consequences. > > > > > > Fix the problem by moving inode size update into end_io handler which > > > gets called before the page cache is invalidated. > > > > Confused. > > > > This moves all the inode extension stuff into the completion > > handler, when all that really needs to be done is extending > > inode->i_size to tell the world there is data up to where the > > IO completed. Actually removing the inode from the orphan list > > does not need to be done in the IO completion callback, because... > > > > > if (ilock_shared) > > > iomap_ops = &ext4_iomap_overwrite_ops; > > > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, > > > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); > > > - if (ret == -ENOTBLK) > > > - ret = 0; > > > - > > > if (extend) > > > - ret = ext4_handle_inode_extension(inode, offset, ret, count); > > > + dio_ops = &ext4_dio_extending_write_ops; > > > > > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, > > > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); > > ^^^^^^ ^^^^^^^^^^^^^^^^^^^ > > > > .... if we are doing an extending write, we force DIO to complete > > before returning. Hence even AIO will block here on an extending > > write, and hence we can -always- do the correct post-IO completion > > orphan list cleanup here because we know a) the original IO size and > > b) the amount of data that was actually written. > > > > Hence all that remains is closing the buffered read vs invalidation > > race. All this requires is for the dio write completion to behave > > like XFS where it just does the inode->i_size update for extending > > writes. THis means the size is updated before the invalidation, and > > hence any read that occurs after the invalidation but before the > > post-eof blocks have been removed will see the correct size and read > > the tail page(s) correctly. This closes the race window, and the > > caller can still handle the post-eof block cleanup as it does now. > > > > Hence I don't see any need for changing the iomap infrastructure to > > solve this problem. This seems like the obvious solution to me, so > > what am I missing? > > All that you write above is correct. The missing piece is: If everything > succeeded and all the cleanup we need is removing inode from the orphan > list (common case), we want to piggyback that orphan list removal into the > same transaction handle as the update of the inode size. This is just a > performance thing, you are absolutely right we could also do the orphan > cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes > to the iomap framework. Doesn't ext4, like XFS, keep two copies of the inode size? One for the on-disk size and one for the in-memory size? /me looks... Yeah, there's ei->i_disksize that reflects the on-disk size. And I note that the first thing that ext4_handle_inode_extension() is already checking that the write is extending past the current on-disk inode size before running the extension transaction. The page cache only cares about the inode->i_size value, not the ei->i_disksize value, so you can update them independently and still have things work correctly. That's what XFS does in xfs_dio_write_end_io - it updates the in-memory inode->i_size, then runs a transaction to atomically update the inode on-disk inode size. Updating the VFS inode size first protects against buffered read races while updating the on-disk size... So for ext4, the two separate size updates don't need to be done at the same time - you have all the state you need in the ext4 dio write path to extend the on-disk file size on successful extending write, and it is not dependent in any way on the current in-memory VFS inode size that you'd update in the ->end_io callback.... > OK, now that I write about this, maybe I was just too hung up on the > performance improvement. Probably a better way forward is that I just fix > the data corruption bug only inside ext4 (that will be also much easier to > backport) and then submit the performance improvement modifying iomap if I > can actually get performance data justifying it. Thanks for poking into > this :) Sounds like a good plan :) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure 2021-04-13 22:45 ` Dave Chinner @ 2021-04-14 11:56 ` Jan Kara 0 siblings, 0 replies; 13+ messages in thread From: Jan Kara @ 2021-04-14 11:56 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong On Wed 14-04-21 08:45:31, Dave Chinner wrote: > On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote: > > On Tue 13-04-21 07:50:24, Dave Chinner wrote: > > > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote: > > > > Eric has noticed that after pagecache read rework, generic/418 is > > > > occasionally failing for ext4 when blocksize < pagesize. In fact, the > > > > pagecache rework just made hard to hit race in ext4 more likely. The > > > > problem is that since ext4 conversion of direct IO writes to iomap > > > > framework (commit 378f32bab371), we update inode size after direct IO > > > > write only after invalidating page cache. Thus if buffered read sneaks > > > > at unfortunate moment like: > > > > > > > > CPU1 - write at offset 1k CPU2 - read from offset 0 > > > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); > > > > ext4_readpage(); > > > > ext4_handle_inode_extension() > > > > > > > > the read will zero out tail of the page as it still sees smaller inode > > > > size and thus page cache becomes inconsistent with on-disk contents with > > > > all the consequences. > > > > > > > > Fix the problem by moving inode size update into end_io handler which > > > > gets called before the page cache is invalidated. > > > > > > Confused. > > > > > > This moves all the inode extension stuff into the completion > > > handler, when all that really needs to be done is extending > > > inode->i_size to tell the world there is data up to where the > > > IO completed. Actually removing the inode from the orphan list > > > does not need to be done in the IO completion callback, because... > > > > > > > if (ilock_shared) > > > > iomap_ops = &ext4_iomap_overwrite_ops; > > > > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, > > > > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); > > > > - if (ret == -ENOTBLK) > > > > - ret = 0; > > > > - > > > > if (extend) > > > > - ret = ext4_handle_inode_extension(inode, offset, ret, count); > > > > + dio_ops = &ext4_dio_extending_write_ops; > > > > > > > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, > > > > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); > > > ^^^^^^ ^^^^^^^^^^^^^^^^^^^ > > > > > > .... if we are doing an extending write, we force DIO to complete > > > before returning. Hence even AIO will block here on an extending > > > write, and hence we can -always- do the correct post-IO completion > > > orphan list cleanup here because we know a) the original IO size and > > > b) the amount of data that was actually written. > > > > > > Hence all that remains is closing the buffered read vs invalidation > > > race. All this requires is for the dio write completion to behave > > > like XFS where it just does the inode->i_size update for extending > > > writes. THis means the size is updated before the invalidation, and > > > hence any read that occurs after the invalidation but before the > > > post-eof blocks have been removed will see the correct size and read > > > the tail page(s) correctly. This closes the race window, and the > > > caller can still handle the post-eof block cleanup as it does now. > > > > > > Hence I don't see any need for changing the iomap infrastructure to > > > solve this problem. This seems like the obvious solution to me, so > > > what am I missing? > > > > All that you write above is correct. The missing piece is: If everything > > succeeded and all the cleanup we need is removing inode from the orphan > > list (common case), we want to piggyback that orphan list removal into the > > same transaction handle as the update of the inode size. This is just a > > performance thing, you are absolutely right we could also do the orphan > > cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes > > to the iomap framework. > > Doesn't ext4, like XFS, keep two copies of the inode size? One for > the on-disk size and one for the in-memory size? > > /me looks... > > Yeah, there's ei->i_disksize that reflects the on-disk size. > > And I note that the first thing that ext4_handle_inode_extension() > is already checking that the write is extending past the current > on-disk inode size before running the extension transaction. Yes. > The page cache only cares about the inode->i_size value, not the > ei->i_disksize value, so you can update them independently and still > have things work correctly. That's what XFS does in > xfs_dio_write_end_io - it updates the in-memory inode->i_size, then > runs a transaction to atomically update the inode on-disk inode > size. Updating the VFS inode size first protects against buffered > read races while updating the on-disk size... > > So for ext4, the two separate size updates don't need to be done at > the same time - you have all the state you need in the ext4 dio > write path to extend the on-disk file size on successful extending > write, and it is not dependent in any way on the current in-memory > VFS inode size that you'd update in the ->end_io callback.... Right, that's a nice trick that will allow us to keep the original performance (I've verified that indeed splitting inode size and orphan updates into separate transactions costs us ~10% of performance when appending 512-byte chunks) without touching generic dio code. Thanks for the idea! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() 2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara 2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara 2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara @ 2021-04-12 10:23 ` Jan Kara 2021-04-12 11:30 ` Matthew Wilcox 2021-06-16 12:22 ` Theodore Ts'o 2 siblings, 2 replies; 13+ messages in thread From: Jan Kara @ 2021-04-12 10:23 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara A code in iomap alloc may overblock block number when converting it to byte offset. Luckily this is mostly harmless as we will just use more expensive method of writing using unwritten extents even though we are writing beyond i_size. Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0948a43f1b3d..7cebbb2d2e34 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3420,7 +3420,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, * i_disksize out to i_size. This could be beyond where direct I/O is * happening and thus expose allocated blocks to direct I/O reads. */ - else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode)) + else if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode)) m_flags = EXT4_GET_BLOCKS_CREATE; else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; -- 2.26.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() 2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara @ 2021-04-12 11:30 ` Matthew Wilcox 2021-06-16 12:22 ` Theodore Ts'o 1 sibling, 0 replies; 13+ messages in thread From: Matthew Wilcox @ 2021-04-12 11:30 UTC (permalink / raw) To: Jan Kara Cc: Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong On Mon, Apr 12, 2021 at 12:23:33PM +0200, Jan Kara wrote: > A code in iomap alloc may overblock block number when converting it to "overflow"? > byte offset. Luckily this is mostly harmless as we will just use more > expensive method of writing using unwritten extents even though we are > writing beyond i_size. > > Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/ext4/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 0948a43f1b3d..7cebbb2d2e34 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3420,7 +3420,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > * i_disksize out to i_size. This could be beyond where direct I/O is > * happening and thus expose allocated blocks to direct I/O reads. > */ > - else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode)) > + else if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode)) > m_flags = EXT4_GET_BLOCKS_CREATE; > else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; > -- > 2.26.2 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() 2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara 2021-04-12 11:30 ` Matthew Wilcox @ 2021-06-16 12:22 ` Theodore Ts'o 1 sibling, 0 replies; 13+ messages in thread From: Theodore Ts'o @ 2021-06-16 12:22 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong On Mon, Apr 12, 2021 at 12:23:33PM +0200, Jan Kara wrote: > A code in iomap alloc may overblock block number when converting it to > byte offset. Luckily this is mostly harmless as we will just use more > expensive method of writing using unwritten extents even though we are > writing beyond i_size. > > Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") > Signed-off-by: Jan Kara <jack@suse.cz> This was part of the patch series "ext4: Fix data corruption when extending DIO write races with buffered read" but which fixes an unrelated problem. The patch series was dropped in favor of a different approach, but it looks like this patch is still applicable, so I've applied with a minor typo fix in the commit description. - Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-06-16 12:22 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara 2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara 2021-04-12 14:07 ` kernel test robot 2021-04-12 14:12 ` kernel test robot 2021-04-12 16:37 ` kernel test robot 2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara 2021-04-12 21:50 ` Dave Chinner 2021-04-13 9:11 ` Jan Kara 2021-04-13 22:45 ` Dave Chinner 2021-04-14 11:56 ` Jan Kara 2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara 2021-04-12 11:30 ` Matthew Wilcox 2021-06-16 12:22 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).